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Abstract 


Today’s  Internet  provides  one  simple  service:  best  effort  datagram  deliveiy  This  minimalist  service 
allows  the  Internet  to  be  stateless,  that  is,  routers  do  not  need  to  maintain  any  fine  grained  informa- 
»  tion  about  traffic.  As  a  result  of  this  stateless  architecture,  the  Internet  is  both  highly  scalable  and 

robust.  However,  as  the  Internet  evolves  into  a  global  commercial  infrastructure  that  is  expected 
to  support  a  plethora  of  new  applications  such  as  IP  telephony,  interactive  TV,  and  e-commerce, 
the  existing  best  effort  service  will  no  longer  be  sufficient.  In  consequence,  there  is  an  urgent  need 
to  provide  more  powerful  services  such  as  guaranteed  services,  differentiated  services,  and  flow 
protection. 

Over  the  past  decade,  there  has  been  intense  research  toward  achieving  this  goal.  Two  classes  of 
solutions  have  been  proposed:  those  maintaining  the  stateless  property  of  the  original  Internet  (e.g., 
Differentiated  Services),  and  those  requiring  a  new  stateful  architecture  (e.g.,  Integrated  Services). 
While  stateful  solutions  can  provide  more  powerful  and  flexible  services  such  as  per  flow  bandwidth 
and  delay  guarantees,  they  are  less  scalable  than  stateless  solutions.  In  particular,  stateful  solutions 
require  each  router  to  maintain  and  manage  per  flow  state  on  the  control  path,  and  to  perform  per 
flow  classification,  scheduling,  and  buffer  management  on  the  data  path.  Since  today’s  routers  can 
handle  millions  of  active  flows,  it  is  difficult,  if  not  impossible,  to  implement  such  solutions  in  a 
scalable  fashion.  On  the  other  hand,  while  stateless  solutions  are  much  more  scalable,  they  offer 
weaker  services. 

The  key  contribution  of  this  dissertation  is  to  bridge  this  long-standing  gap  between  stateless 
and  stateful  solutions  in  packet  switched  networks  such  as  the  Internet.  Our  thesis  is  that  'Ht  is 
actually  possible  to  provide  services  as  powerful  and  as  flexible  as  the  ones  implemented  by  a 
stateful  network  using  a  stateless  network''.  To  prove  this  thesis,  we  propose  a  novel  technique 
called  Dynamic  Packet  State  (DPS).  The  key  idea  behind  DPS  is  that,  instead  of  having  routers 
maintain  per  flow  state,  packets  carry  the  state.  In  this  way,  routers  are  still  able  to  process  packets 
on  a  per  flow  basis,  despite  the  fact  that  they  do  not  maintain  any  per  flow  state.  Based  on  DPS, 
we  develop  a  network  architecture  called  Stateless  Core  (SCORE)  in  which  core  routers  do  not 
maintain  any  per  flow  state.  Yet,  using  DPS  to  coordinate  actions  of  edge  and  core  routers  along 
the  path  traversed  by  a  flow  allows  us  to  design  distributed  algorithms  that  emulate  the  behavior  of 
a  broad  class  of  stateful  networks  in  SCORE  networks. 

^  In  this  dissertation  we  describe  complete  solutions  including  architectures,  algorithms  and  im¬ 

plementations  which  address  three  of  the  most  important  problems  in  today’s  Internet:  providing 
guaranteed  services,  differentiated  services,  and  flow  protection.  Compared  to  existing  solutions, 
f  our  solutions  eliminate  the  most  complex  operations  on  both  the  data  and  control  paths  in  the  net¬ 

work  core,  i.e.,  packet  classification  on  the  data  path,  and  maintaining  per  flow  state  consistency 
on  the  control  path.  In  addition,  the  complexities  of  buffer  management  and  packet  scheduling 
are  greatly  reduced.  For  example,  in  our  flow  protection  solution  these  operations  take  constant 
time,  while  in  previous  solutions  these  operations  may  take  time  logarithmic  in  the  number  of  flows 
traversing  the  router. 
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1.1  (a)  A  reference  stateful  network  whose  functionality  is  approximated  by  (b)  a  Stateless  Core 

(SCORE)  network.  In  SCORE  only  edge  nodes  maintain  per  flow  state  and  perform  per  flow 
management;  core  nodes  do  not  maintain  any  per  flow  state . 

1.2  An  illustration  of  the  Dynamic  Packet  State  (DPS)  technique  used  to  implement  per  flow 

services  in  a  SCORE  network:  (a-b)  upon  a  packet  arrival  the  ingress  node  inserts  some  flow 
dependent  state  into  the  packet  header;  (b-c)  a  core  node  processes  the  packet  based  on  this 
state,  and  eventually  updates  both  its  internal  state  and  the  packet  state  before  forwarding  it. 
(c-d)  the  egress  node  removes  the  state  from  the  packet  header. . 

2.1  The  architecture  of  a  router  that  provides  per  flow  quality  of  service  (QoS).  Input  inter¬ 

faces  use  routing  lookup  or  packet  classification  to  select  the  appropriate  output  interface 
for  each  incoming  packet,  while  output  interfaces  implement  packet  classification,  buffer 
management,  and  packet  scheduling.  In  today's  best  effort  routers,  neither  input  nor  output 
interfaces  implement  packet  classification . 

3.1  Example  illustrating  the  CSFQ  algorithm  at  a  core  router.  An  output  link  with  a  capacity 

of  10  is  shared  by  three  flows  with  arrival  rates  of  8,  6,  and  2,  respectively.  The  fair  rate  of 
the  output  link  in  this  case  is  a  =  4.  Each  arrival  packet  carries  in  its  header  the  rate  of  the 
flow  it  belongs  to.  According  to  Eq.  (3.3)  the  dropping  probability  for  flow  1  is  0.5,  while 
for  flow  2  it  is  0.33.  Dropped  packets  are  indicated  by  crosses.  Before  forwarding  a  packet, 
its  header  is  updated  to  reflect  the  change  in  the  flow's  rate  due  to  packet  dropping  (see  the 
packets  at  the  right-hand  side  of  the  router) . 

3.2  Example  illustrating  the  estimation  of  the  aggregate  reservation.  Two  flows  with  reserva¬ 

tions  of  5,  and  2,  respectively,  share  a  common  link.  Ingress  routers  initialize  the  header  of 
each  packet  according  to  Eq.  (3.4).  The  aggregate  reservation  is  estimated  as  the  ratio  be¬ 
tween  the  sum  of  the  values  carried  in  the  packets'  headers  during  an  averaging  time  interval 
of  length  T.  In  this  case  the  estimated  reservation  is  65/10  6.5 . 

3.3  Example  illustrating  the  route  pinning  algorithm.  Each  packet  contains  in  its  header  the 

path's  label,  defined  as  the  xor  over  the  identifiers  of  all  routers  on  the  remaining  path  to 
the  egress!  Upon  packet  arrival,  the  packet's  header  is  updated  to  the  label  of  the  remaining 
path.  The  routing  decisions  are  exclusively  based  on  the  packet’s  label  (here  the  labels  are 
assumed  to  be  unique) . . 


3.4  The  topology  used  in  the  experiment  reported  in  Section  3.2.  Flow  1  is  CBR,  has  an  arrival 

rate  of  1  Mbps,  and  a  reservation  of  1  Mbps.  Flow  2  is  ON-OFF;  it  sends  3  Mbps  during 
ON  periods  and  doesn't  send  anything  during  OFF  periods.  The  flow  has  a  reservation  of  3 
Mbps.  Flow  3  is  best-effort  and  has  an  airival  rate  of  8  Mbps.  The  link  between  aruba  and 
cozumel  is  configured  to  10  Mbps .  38 

3.5  A  screen-shot  of  our  monitoring  tool  that  displays  the  real-time  measurement  results  for  the 
experiment  shown  in  Figure  3.4.  The  top  two  plots  show  the  arrival  rate  of  each  flow  at 
aruba  and  cozumel;  the  bottom  two  plots  show  the  delay  experienced  by  each  flow  at 

the  two  routers .  38 

4.1  (a)  A  reference  stateful  network  that  provides  fair  bandwidth  allocation;  each  node  im¬ 

plements  the  Fair  Queueing  algorithm,  (b)  A  SCORE  network  that  approximates  the  ser¬ 
vice  provided  by  the  reference  network:  each  node  implements  our  algorithm,  called  Core- 


Stateless  Fair  Queueing  (CSFQ) .  47 

4.2  The  architecture  of  the  output  port  of  an  edge  router,  and  a  core  router,  respectively .  49 

4.3  The  pseudocode  of  CSFQ .  50 

4.4  The  pseudocode  of  CSFQ  (fair  rate  estimation) .  51 

4.5  (a)  A  10  Mbps  link  shared  by  N  flows,  (b)  The  average  throughput  over  10  sec  when  N  = 

32,  and  all  flows  are  CBRs.  The  arrival  rate  for  flow  i  is  (i  +  1 )  times  larger  than  its  fair 
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Chapter  1 


Introduction 


Today’s  Internet  provides  one  simple  service:  best  effort  datagram  delivery.  Such  a  minimalist 
service  allows  routers  to  be  stateless,  that  is,  except  for  the  routing  state,  which  is  highly  aggregated, 
routers  do  not  need  to  maintain  any  fine  grained  state  about  traffic.  As  a  consequence,  today’s 
Internet  is  both  highly  scalable  and  robust.  It  is  scalable  because  router  complexity  does  not  increase 
in  either  the  number  of  flows  or  the  number  of  nodes  in  the  network,  and  it  is  robust  because  there 
is  little  state,  if  any,  to  update  when  a  router  fails  or  recovers.  The  scalability  and  robustness  are  two 
of  the  most  important  reasons  behind  the  success  of  today’s  Internet. 

However,  as  the  Internet  evolves  into  a  global  commercial  infrastructure,  there  is  a  growing  need 
to  provide  more  powerful  services  than  best  effort  such  as  guaranteed  services,  differentiated  ser¬ 
vices,  and  flow  protection.  Guaranteed  services  would  make  it  possible  to  guarantee  performance 
parameters  such  as  bandwidth  and  delay  on  a  per  flow  basis.  An  example  would  be  to  guarantee 
that  a  flow  receives  at  least  a  specified  amount  of  bandwidth,  ensuring  that  the  delay  experienced 
by  its  packets  does  not  exceed  a  specified  threshold.  This  service  would  provide  support  for  new 
applications  such  as  IP  telephony,  video-conferencing,  and  remote  diagnostics.  Differentiated  ser¬ 
vices  would  allow  us  to  provide  bandwidth  and  loss  rate  differentiation  for  traffic  aggregates  over 
multiple  granularities  ranging  from  individual  flows  to  the  entire  traffic  of  a  large  organization.  An 
example  would  be  to  allocate  to  one  organization  twice  as  much  bandwidth  on  every  link  in  the  net¬ 
work  as  another  organization.  Flow  protection  would  allow  diverse  end-to-end  congestion  control 
schemes  to  seamlessly  coexist  in  the  Internet,  protecting  the  well  behaved  traffic  from  the  malicious 
or  ill-behaved  traffic.  For  example,  if  two  flows  share  the  same  link,  with  flow  protection,  each  flow 
will  get  at  least  half  of  the  link  capacity  independent  of  the  behavior  of  the  other  flow,  as  long  as 
the  flow  has  enough  demand.  In  contrast,  in  today’s  Internet,  a  malicious  flow  that  sends  traffic  at 
a  higher  rate  than  the  link  capacity  can  provoke  packet  losses  to  another  flow  no  matter  how  little 
traffic  that  flow  sends! 

Providing  these  services  in  packet  switched  networks  such  as  the  Internet  has  been  one  of  the 
major  challenges  in  the  network  research  over  the  past  decade.  To  address  this  challenge,  a  plethora 
of  techniques  and  mechanisms  have  been  developed  for  packet  scheduling,  buffer  management, 
and  signaling.  While  the  proposed  solutions  are  able  to  provide  very  powerful  network  services, 
they  come  at  a  cost:  complexity.  In  particular,  these  solutions  usually  assume  a  stateful  network 
architecture,  that  is,  a  network  in  which  every  router  maintains  per  flow  state.  Since  there  can  be  a 
large  number  of  active  flows  in  the  Internet,  and  this  number  is  expected  to  continue  to  increase  at  an 
exponential  rate,  it  is  an  open  question  whether  such  an  architecture  can  be  efficiently  implemented. 


1 


Q  edge  node  Q  core  node 


a)  Reference  Stateful  Network 


b)  SCORE  Network 


Figure  1.1:  (a)  A  reference  stateful  network  whose  functionality  is  approximated  by  (b)  a  Stateless  Core 
(SCORE)  network.  In  SCORE  only  edge  nodes  maintain  per  flow  state  and  perfonn  per  flow  management; 
core  nodes  do  not  maintain  any  per  flow  state. 


In  addition,  due  to  the  complex  algorithms  required  to  set  and  preserve  the  state  consistency  across 
the  network,  robustness  is  much  harder  to  achieve. 

In  summary,  while  stateful  architectures  can  provide  more  sophisticated  services  than  the  best 
effort  service,  stateless  architectures  such  as  the  cun'ent  Internet  are  more  scalable  and  robust.  The 
natural  question  is  then:  Can  we  achieve  the  best  of  the  two  worlds?  That  is,  is  it  possible  to 
provide  seiyices  as  poweiful  and  flexible  as  the  ones  implemented  by  a  stateful  network  in  a  stateless 
network? 

In  this  dissertation  we  answer  this  question  affirmatively  by  showing  that  some  of  the  most 
representative  Internet  services  that  require  stateful  networks  can  indeed  be  implemented  in  a  mostly 
stateless  network  architecture. 


1.1  Main  Contribution 

The  main  contribution  of  this  dissertation  is  to  provide  the  first  solution  that  bridges  the  long¬ 
standing  gap  betw^een  stateless  and  stateful  network  architectures.  In  particular,  we  show  that  three 
of  the  most  important  Internet  services  proposed  in  literature  during  the  past  decade,  and  for  which 
the  previous  known  solutions  require  stateful  networks,  can  be  implemented  in  a  stateless  core 
network.  These  services  are:  (1)  guaranteed  services,  (2)  service  differentiation  for  large  granularity 
traffic,  and  (3)  flow  protection  to  provide  network  support  for  congestion  control. 

The  main  goal  of  our  solution  is  to  push  the  state  and  therefore  the  complexity  out  of  the  network 
core,  without  compromising  network  ability  to  provide  per  flow  services.  The  key  ideas  that  allow 
us  to  achieve  this  goal  are: 

1.  instead  of  having  core  nodes  maintain  per  flow  state,  have  packets  carry  this  state,  and 

2.  use  the  state  carried  by  the  packets  to  implement  distributed  algorithms  to  provide  network 
services  as  powerful  and  as  flexible  as  the  ones  implemented  by  stateful  networks 
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Figure  1.2:  An  illustration  of  the  Dynamic  Packet  State  (DPS)  technique  used  to  implement  per  flow  services 
in  a  SCORE  network:  (a-b)  upon  a  packet  arrival  the  ingress  node  inserts  some  flow  dependent  state  into  the 
packet  header;  (b-c)  a  core  node  processes  the  packet  based  on  this  state,  and  eventually  updates  both  its 
internal  state  and  the  packet  state  before  forwarding  it.  (c-d)  the  egress  node  removes  the  state  from  the 
packet  header. 

The  following  paragraphs  present  the  main  components  of  our  solution: 

The  Stateless  Core  (SCORE)  Network  Architecture  The  basic  building  block  of  our  solution  is 
the  Stateless  Core  (SCORE)  domain.  We  define  a  SCORE  domain  as  being  a  trusted  and  contiguous 
region  of  network  in  which  only  edge  routers  maintain  per  flow  state;  the  core  routers  do  not  main¬ 
tain  any  per  flow  state  (see  Figure  1 .1(b)).  Since  edge  routers  usually  run  at  a  much  lower  speed  and 
handle  far  fewer  flows  than  core  routers,  this  architecture  is  highly  scalable. 

The  ‘‘State-Elimination”  Approach  Our  ultimate  goal  is  to  provide  powerful  and  flexible  network 
services  in  a  stateless  network  architecture.  To  achieve  this  goal,  we  propose  an  approach,  called 
“state-elimination”  approach,  that  consists  of  two  steps  (see  Figure  1.1).  The  first  step  is  to  define  a 
reference  stateful  network  that  implements  the  desired  service.  The  second  step  is  to  approximate 
or,  if  possible,  to  emulate  the  functionality  of  the  reference  network  in  a  SCORE  network.  By  doing 
this,  we  can  provide  services  as  powerful  and  flexible  as  the  ones  implemented  by  a  stateful  network 
in  a  mostly  stateless  network  architecture,  i.e.,  in  a  SCORE  network. 

The  Dynamic  Packet  State  (DPS)  Technique  To  implement  the  approach,  we  propose  a  novel 
technique  called  Dynamic  Packet  State  (DPS).  As  shown  in  Figure  1.2,  with  DPS,  each  packet 
carries  in  its  header  some  state  that  is  initialized  by  the  ingress  router.  Core  routers  process  each 
incoming  packet  based  on  the  state  carried  in  the  packet’s  header,  updating  both  its  internal  state  and 
the  state  in  the  packet’s  header  before  forwarding  it  to  the  next  hop.  In  this  way,  routers  are  able  to 
process  packets  on  a  per  flow  basis,  despite  the  fact  that  they  do  not  maintain  per  flow  state.  As  we 
will  demonstrate  in  this  dissertation,  by  using  DPS  to  coordinate  the  actions  of  edge  and  core  routers 
along  the  path  traversed  by  a  flow,  it  is  possible  to  design  distributed  algorithms  to  approximate  the 
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behaviol*  of  a  broad  class  of  stateful  networks  using  networks  in  which  core  routers  do  not  maintain 
per  flow  state. 

The  ‘‘Verify-and-Protecf’  Approach  While  our  solutions  based  on  SCORE/DPS  have  many 
advantages  over  traditional  stateful  solutions,  they  still  suffer  from  robustness  and  scalability  lim¬ 
itations  when  compared  to  stateless  solutions.  The  scalability  of  the  SCORE  architecture  suffers 
from  the  fact  that  the  network  core  cannot  transcend  taist  boundaries  (such  as  boundaries  between 
competing  Internet  Service  Providers),  and  therefore  high-speed  routers  on  these  boundaries  must 
be  stateful  edge  routers.  System  robustness  is  limited  by  the  possibility  that  a  single  edge  or  core 
router  may  malfunction,  insetting  eironeous  information  in  the  packet  headers,  thus  severely  im¬ 
pacting  performance  of  the  entire  network. 

In  Chapter  7  we  propose  an  approach,  called  “verify-and-protect’\  that  overcomes  these  limita¬ 
tions.  We  achieve  scalability  by  pushing  the  complexity  all  the  way  to  the  end-hosts,  eliminating 
the  distinction  between  edge  and  core  routers.  To  address  the  trust  and  robustness  issues,  all  routers 
statistically  verify  that  the  incoming  packets  are  correctly  marked.  This  approach  enables  routers  to 
discover  and  isolate  misbehaving  end-hosts  and  routers. 


1.2  Other  Contributions 

To  achieve  the  goal  of  providing  the  same  level  of  services  in  a  SCORE  network  as  in  traditional 
stateful  networks,  we  propose  several  novel  distributed  algorithms  that  use  DPS  to  coordinate  the 
actions  between  the  edge  and  core  nodes.  Among  these  algorithms  are: 

Core-Stateless  Fair  Queueing  (CSFQ)  This  is  the  first  algorithm  to  approximate  the  band¬ 
width  allocation  achieved  by  a  stateful  network  in  which  all  routers  implement  Fair  Queue¬ 
ing  [31 , 79]  in  a  core  stateless  network.  As  discussed  in  Chapter  4,  CSFQ  allows  us  to  provide 
per  flow  protection  in  a  SCORE  network. 

Core  Jitter  Virtual  Clock  (CJVC)  This  is  the  first  algorithm  to  provide  the  same  worst- 
case  bandwidth  and  delay  guarantees  as  Jitter  Virtual  Clock  [126]  and  Weighted  Fair  Queue¬ 
ing  [31,  79]  in  a  network  architecture  in  which  core  routers  maintain  no  per  flow  state.  CJVC 
implements  the  full  functionality  on  the  data  path  to  provide  guaranteed  services  in  a  SCORE 
network  (see  Chapter  5). 

Distributed  admission  control  We  propose  a  distributed  per  flow  admission  control  proto¬ 
col  in  which  core  routers  need  to  maintain  only  aggregate  reservation  state.  To  maintain  this 
state,  we  develop  a  robust  algorithm  based  on  DPS  that  provides  the  same  or  even  stronger 
semantics  than  those  provided  by  previously  proposed  stateful  solutions  such  as  the  ATM 
User-to-Network  (UNI)  signaling  protocol  and  Reservation  Protocol  (RSVP)  [1,  128].  Ad¬ 
mission  control  is  a  key  component  of  providing  guaranteed  services.  It  allows  us  to  reserve 
bandwidth  and  buffer  space  at  each  router  along  a  flow  path  to  make  sure  that  flow  bandwidth 
and  delay  requirements  are  met. 

Route  pinning  We  propose  a  light-weight  protocol  and  mechanisms  to  bind  a  flow  to  a 
specific  route  (path)  through  a  network  domain,  without  requiring  core  routers  to  maintain 
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per  flow  state.  This  can  be  viewed  as  an  alternative  to  Multi-Protocol  Label  Switching 
(MPLS)  [17].  Our  solutions  for  guaranteed  and  differentiated  services  use  route  pinning  to 
make  sure  that  all  packets  of  a  flow  traverse  the  same  path  (see  Chapters  5  and  6). 

A  major  challenge  in  implementing  the  DPS-based  algorithms  is  to  find  extra  space  in  the  packet 
header  to  encode  the  per  flow  state.  Since  this  extra  space  is  at  premium,  especially  in  the  context  of 
IPv4,  we  need  to  encode  the  state  as  efficiently  as  possible.  To  address  this  problem,  we  introduce 
two  general  methods  to  achieve  efficient  state  encoding. 

In  the  first  method,  the  idea  is  to  leverage  knowledge  about  the  state  semantics.  In  particular,  to 
save  space  we  can  use  this  knowledge  to  store  a  value  as  a  function  of  another  value.  For  example, 
if  a  value  is  known  to  be  always  greater  than  another  value,  we  can  use  an  accurate  floating  point 
representation  to  represent  the  larger  value,  and  store  the  smaller  value  as  a  fraction  of  the  larger 
one. 

The  idea  behind  the  second  method  is  to  have  different  packets  of  a  flow  carry  different  state 
formats.  This  method  is  appropriate  for  algorithms  that  do  not  require  all  packets  to  carry  the  same 
t3q)e  of  state.  For  example,  an  algorithm  may  use  the  same  field  in  the  packet  header  to  insert  either 
data  or  control  path  information,  as  long  as  this  will  not  compromise  the  service  semantics. 

1.3  Evaluation 

In  order  to  evaluate  the  solutions  proposed  in  this  dissertation,  we  try  to  answer  the  following  three 
questions: 

1.  How  scalable  are  the  algorithms  implemented  by  core  routers?  Scalability  represents  the 
ability  of  the  network  to  grow  in  the  number  of  flows  (users),  the  number  of  nodes,  and  the 
traffic  volume.  To  answer  this  question,  we  express  the  complexity  of  the  proposed  algorithms 
as  a  function  of  these  parameters.  In  particular,  we  will  show  that  our  DPS  based  algorithms 
implemented  by  core  routers  are  highly  scalable  as  their  complexity  does  not  depend  on  either 
the  number  of  flows  or  the  network  size. 

2.  How  close  is  the  service  provided  by  our  solution  to  the  service  provided  by  the  reference 
stateful  network?  A  service  is  usually  defined  in  terms  of  performance  parameters  such  as 
bandwidth,  delay  and  loss  rate.  We  answer  this  question  by  comparing  the  performance  pa¬ 
rameters  achieved  under  our  solution  and  the  reference  stateful  solution.  For  example,  in  the 
case  of  the  guaranteed  services,  we  will  show  that  end-to-end  delay  bounds  of  a  flow  in  our 
core  stateless  solution  are  identical  to  the  end-to-end  delay  bounds  of  the  same  flow  in  the 
reference  stateful  solution  (see  Section  5.3.3). 

3.  How  does  the  service  provided  by  our  solution  compare  to  similar  services  provided  by  ex¬ 
isting  stateless  solutions?  Again,  we  answer  this  question  by  comparing  the  performance 
parameters  of  services  provided  by  our  solution  and  the  stateless  solutions.  However,  unlike 
the  previous  question  where  the  goal  is  to  see  how  well  we  emulate  the  target  service  imple¬ 
mented  by  a  reference  stateful  network,  in  this  case,  our  goal  is  to  see  how  much  we  gain  in 
terms  of  service  quality  in  comparison  to  existing  stateless  solutions.  For  example,  in  the  case 
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of  flow  protection,  we  will  show  that  none  of  the  traditional  solutions  that  exhibit  the  same 
complexity  at  core  routers  is  effective  in  providing  flow  protection  (see  Section  4.4). 

To  address  the  above  three  questions,  we  use  a  mix  of  theoretical  analysis,  simulations,  and 
experimental  results.  In  particular,  to  answer  the  first  question,  we  use  theoretical  analysis  to  derive 
the  time  and  space  complexity  of  the  algorithms  performed  by  both  edge  and  core  routers.  To  answer 
the  last  two  questions  we  derive  worst-case  or  asymptotic  bounds  for  the  pcrfoimance  parameters 
that  characterize  the  service,  such  as  delay  and  bandwidth.  Whenever  we  cannot  obtain  such  bounds, 
or  if  we  want  to  relax  the  assumptions  to  fit  more  realistic  scenarios,  we  rely  on  extensive  simulations 
by  using  an  accurate  packet  level  simulator  such  as  ns-2  [78]. 

For  illustration,  consider  our  solution  to  provide  per  flow  protection  in  a  SCORE  network  (see 
Chapter  4).  To  answer  the  scalability  question  we  show  that  in  our  solution  a  core  router  does  not 
need  to  maintain  any  per  flow  state,  and  that  the  time  it  takes  to  process  a  packet  is  independent  of 
the  number  of  flows  that  traverse  the  router,  ?/..  In  contrast,  with  the  existing  solutions,  each  router 
needs  to  maintain  state  for  every  flow,  and  the  time  it  takes  to  process  a  packet  increases  with  log  ii. 
Consequently,  our  solution  exhibits  an  0(1)  space  and  time  complexity,  as  compared  to  existing 
solutions  that  exhibit  an  0(7/)  space  complexity,  and  an  0(log7/)  time  complexity.  To  answer 
the  second  question  we  use  theoretical  analysis  to  show  that  the  difference  between  the  average 
bandwidth  allocated  to  a  flow  in  a  SCORE  network  and  the  bandwidth  allocated  to  the  same  flow 
in  the  reference  network  is  bounded.  In  addition,  to  answer  the  third  question  and  to  study  more 
realistic  scenarios,  we  use  extensive  simulations. 

Finally,  to  demonstrate  the  viability  of  our  solutions  and  to  explore  the  compatibility  of  the  DPS 
technique  with  IPv4,  we  present  a  detailed  implementation  in  FreeBSD,  as  well  as  experimental 
results,  to  evaluate  accuracy  and  implementation  overhead. 


1.4  Discussion 

In  this  dissertation,  we  make  two  central  assumptions.  The  first  is  that  the  ability  to  process  packets 
on  a  per  flow  basis  is  beneficial,  and  perhaps  even  crucial,  for  supporting  new  emerging  applications 
in  the  Internet.  The  second  is  that  it  is  very  hard,  if  not  impossible,  for  traditional  stateful  solutions 
to  support  these  services  in  high-speed  backbone  routers.  It  is  important  to  note  that  these  two  as¬ 
sumptions  do  not  necessary  imply  that  it  is  infeasible  to  support  these  emerging  services  in  high 
speed  networks.  They  just  illustrate  the  drawback  of  existing  solutions  that  require  routers  to  main¬ 
tain  and  manage  per  flow  state.  In  this  dissertation  we  eliminate  this  problem,  by  demonstrating  that 
it  is  possible  to  process  packet  on  a  per  flow  basis  without  requiring  high-speed  routers  to  maintain 
any  per  flow  state. 

The  next  two  sections  motivate  these  assumptions. 

1.4.1  Why  Per  Flow  Processing? 

The  ability  to  process  packets  on  a  per  flow  basis  is  important  because  it  would  allow  us  simul¬ 
taneously  (1)  to  support  applications  with  different  performance  requirements,  and  (2)  to  achieve 
high  resources  utilization.  To  illustrate  this  point  consider  a  simple  example  in  which  a  file  transfer 
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application  and  an  audio  application  share  the  same  link.  On  one  hand,  we  want  the  file  transfer 
application  to  be  able  to  use  the  entire  link  capacity,  when  the  audio  source  does  not  send  any  traffic. 
On  the  other  hand,  when  the  audio  application  starts  the  transmission,  we  want  this  application  to  be 
able  immediately  to  reclaim  its  share  of  the  link  capacity.  In  addition,  since  the  audio  application  is 
much  more  sensitive  to  packet  delay  than  the  file  transfer  application,  we  should  be  able  to  preferen¬ 
tially  treat  the  audio  traffic  in  order  to  minimize  its  delay.  As  demonstrated  by  previous  proposals, 
such  a  functionality  can  be  easily  provided  in  a  stateful  network  in  which  routers  process  packets  on 
*  a  per  flow  basis  [10,  48,  106]. 

A  natural  question  to  ask  is  whether  performing  packet  processing  at  a  coarser  granularity,  that 
is,  on  a  per  class  basis,  wouldn’t  allow  us  to  achieve  similar  results.  With  such  an  approach,  appli¬ 
cations  with  similar  perfoimance  requirements  would  be  aggregated  in  the  same  traffic  class.  This 
would  make  routers  much  simpler  to  implement,  as  they  need  to  differentiate  between  potentially 
only  a  small  number  of  classes,  rather  than  a  large  number  of  flows.  While  this  approach  can  go  a 
long  way  to  support  new  applications  in  a  scalable  fashion,  it  has  fundamental  limitations.  The  main 
problem  is  that  this  approach  implicitly  assumes  that  all  applications  in  the  same  class  (1)  cooperate, 
and  (2)  have  similar  requirements  at  each  router.  If  assumption  (1)  does  not  hold,  then  malicious 
users  may  arbitrarily  degrade  the  service  of  other  users  in  the  same  class.  If  assumption  (2)  does 
not  hold,  it  is  very  hard  to  meet  all  application  requirements  and  simultaneously  achieve  efficient 
resource  utilization.  Unfortunately,  these  assumptions  do  not  necessarily  hold  in  practice.  As  we 
discuss  in  Chapter  4,  cooperation  is  hai*d  to  achieve  in  today’s  Internet:  even  in  the  absence  of  ma¬ 
licious  users,  there  is  a  natural  incentive  for  a  user  to  aggressively  send  more  and  more  traffic  in  the 
hope  of  making  other  users  quit  and  grabbing  their  resources.  Assumption  (2)  may  not  hold  simply 
because  applications  care  about  the  end-to-end  performance,  and  not  about  the  local  performance 
they  experience  at  a  particular  router.  As  a  result,  applications  with  similar  end-to-end  performance 
requirements  may  end  up  having  very  different  performance  requirements  at  individual  routers.  For 
example,  consider  two  flows  that  carry  voice  traffic  and  belong  to  the  same  class,  one  traversing 
a  15  node  path,  and  another  traversing  a  three  node  path.  In  addition,  assume  that,  as  suggested 
by  recent  studies  in  the  area  of  interactive  voice  communication  [7,  64],  the  tolerable  end-to-end 
delay  for  both  flows  is  about  100  ms,  and  that  the  propagation  delay  alone  along  each  path  is  10 
ms.  Then,  while  the  first  flow  can  afford  a  delay  of  only  6  ms  per  router,  the  second  flow  can  afford 
a  delay  of  up  to  30  ms  per  router.  But  if  both  flows  traverse  the  same  router,  the  router  will  have 
to  provide  a  6  ms  delay  to  both  flows,  as  it  does  not  have  any  way  to  differentiate  between  the  two 
flows.  Unfortunately,  as  we  show  in  Appendix  B.l,  even  under  very  low  link  utilization  (e.g.,  15%), 
it  is  very  difficult  to  provide  small  delay  bounds  for  all  flows. 

In  summary,  the  ability  to  process  packets  on  a  per  flow  basis  is  highly  desirable  not  only  because 
it  allows  us  to  support  applications  with  diverse  needs,  but  also  because  it  allows  us  to  maximize  the 
resource  utilization  by  closely  matching  the  application  requirements  to  resource  consumption. 

1.4.2  Scalability  Concerns  with  Stateful  Network  Architectures 

In  this  section,  we  argue  that  the  existing  solutions  that  enable  packet  processing  on  a  per  flow 
basis,  that  is,  stateful  solutions,  have  serious  scalability  limitations,  and  that  these  limitations  make 
the  deployment  of  these  solutions  unlikely  in  the  foreseeable  future. 
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Recall  that  by  scalability  we  mean  the  ability  of  a  network  to  grow  in  the  number  of  nodes,  in  the 
number  of  users  it  can  support,  and  the  traffic  volume  it  can  cany.  Since  in  today’s  Internet  these 
parameters  increase  at  an  exponential  rate,  scalability  is  a  fundamental  property  of  any  protocol 
or  algorithm  to  be  deployed  in  the  Internet.  Indeed,  according  to  recent  statistics,  Internet  traffic 
doubles  eveiy  six  months,  and  it  is  expected  to  do  so  until  2008  [88].  This  growth  is  fueled  by 
both  the  exponential  increase  in  the  number  of  hosts,  and  the  increase  of  bandwidth  available  to 
end  users.  The  estimated  number  of  hosts*  reached  72  million  in  Febniary  2000,  and  it  is  expected 
to  reach  1  billion  by  2008  [89].  In  addition,  the  replacement  of  the  ubiquitous  56  Kbps  modems 
with  cable  modems  and  Digital  Subscriber  Line  (DSL)  connections  will  increase  home  users’  access 
bandwidth  by  at  least  one  order  of  magnitude. 

In  spite  of  such  a  rapid  growth,  a  question  still  remains:  with  the  continuous  increase  in  available 
processor  speed  and  memoiy  capacity,  wouldn’t  it  be  feasible  to  implement  stateful  solutions  at  veiy 
high  speeds?  In  the  remainder  of  this  section,  we  answer  this  question.  In  particular,  we  first  discuss 
why  it  is  hard  to  implement  per  flow  solutions  today,  and  then  we  argue  that  it  will  he  even  harder 
to  implement  them  in  the  foreseeable  future. 

Very  high-end  routers  today  can  switch  on  the  order  of  terabits  per  second,  and  handle  individual 
links  of  up  to  20  Gbps  [2].  With  an  average  packet  size  of  500  bytes,  an  input  has  only  25  ns 
to  process  a  packet.  If  we  assume  a  1  GHz  processor  that  is  capable  of  executing  an  instmction 
every  clock  cycle,  we  have  have  just  25  instructions  available  per  packet.  During  this  time  a  router 
has  to  read  the  packet  header,  classify  the  packet  to  the  flow  it  belongs  to  based  on  the  fields  in 
the  packet  header,  and  then  process  the  packet  based  on  the  state  associated  to  the  flow.  Packet 
processing  may  include  rate  regulation,  and  packet  scheduling  based  on  some  arbitrary  parameter 
such  as  the  packet  deadline.  In  addition,  stateful  solutions  requires  the  set  up  of  per  flow  state,  and 
the  maintenance  of  this  state  consi.stency  at  all  routers  on  the  flow’s  path.  Maintaining  the  state 
consistency  in  a  distributed  network  environment  such  as  the  Internet  in  which  packets  can  be  lost 
or  arbitrary  delayed,  and  routers  can  fail  is  a  very  difficult  problem  [4,  1 17].  Primarily  due  to  these 
technical  difficulties,  none  of  the  high-end  routers  today  implement  stateful  solutions. 

While  throwing  more  and  more  transistors  at  the  problem  will  help,  this  will  not  necessarily 
solve  the  problem.  Even  if,  as  Moore’s  law  predicts,  processor  performance  continues  to  double 
every  18  month,  this  increase  may  not  be  able  to  offset  the  faster  increase  of  the  Internet  traffic  vol¬ 
ume,  which  doubles  every  six  moths.  Worse  yet,  the  increase  in  the  router  capacity  not  only  reduces 
the  time  available  to  process  a  packet,  but  can  also  increase  the  amount  of  work  the  router  has  to 
do  per  packet.  This  is  because  a  higher  speed  router  will  handle  more  flows,  and  the  complexity 
of  some  of  the  per  packet  operations,  such  as  packet  classifications,  and  scheduling,  depends  on 
the  number  of  flows.  Even  factoring  out  the  algorithmic  complexity,  maintaining  per  flow  state  has 
the  disadvantage  of  requiring  a  large  memory  footprint,  which  will  negatively  impact  the  memory 
access  times.  Finally,  the  advances  in  semiconductor  performances  will  do  little  to  address  the  chal¬ 
lenge  of  maintaining  the  per  flow  state  consistency,  arguably  the  most  difficult  problem  faced  by 
today’s  proposals  to  provide  per  flow  services. 

'This  number  represents  only  hosts  with  Domain  Names.  The  actual  number  of  computers  that  are  connected  to  tlie 
Internet  is  much  larger,  but  this  number  is  much  more  difficult  to  estimate. 
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1.5  Organization 


The  rest  of  this  dissertation  is  organized  as  follows:  Chapter  2  provides  background  information. 
In  the  first  part,  it  presents  the  IP  network  model  which  is  the  foundation  of  today’s  Internet.  In 
the  second  part,  it  discusses  two  of  the  most  prominent  proposals  to  provide  better  service  in  the 
Internet:  Integrated  Seiwices  and  Differentiated  Services.  The  chapter  emphasizes  the  trade-offs 
between  providing  stronger  semantics  seiwices  and  implementation  complexity. 

Chapter  3  describes  the  main  components  of  our  solution,  and  gives  three  simple  examples  to 
illustrate  the  DPS  technique.  The  solution  is  then  compared  in  terms  of  scalability  and  robustness 
against  traditional  solutions  aiming  to  provide  similar  services  in  the  Internet. 

Chapters  4,  5,  and  6  describe  three  important  network  services  that  can  be  implemented  by 
our  solution:  (1)  flow  protection  to  provide  network  support  for  congestion  control,  (2)  guaranteed 
services,  and  (3)  service  differentiation  for  large  traffic  aggregates,  respectively.  Our  solution  is 
the  first  to  implement  flow  protection  for  congestion  control  and  guaranteed  services  in  a  stateless 
core  network  architecture.  We  use  simulations  or  experimental  results  to  evaluate  our  solutions  and 
compare  them  to  existing  solutions  that  provide  similar  services. 

Chapter  7  describes  a  novel  approach  called  “verify-and-protect”  to  overcome  some  of  the  seal- 
ability  and  robustness  limitations  of  our  solution.  We  illustrate  this  approach  in  the  context  of 
providing  per  flow  protection,  by  developing  tests  to  accurately  identify  misbehaving  nodes,  and 
present  simulation  results  to  demonstrate  the  effectiveness  of  the  approach. 

Chapter  8  presents  our  prototype  implementation  which  provides  guaranteed  services  and  per 
flow  protection.  It  discusses  compatibility  issues  with  the  IPv4  protocol,  and  the  information  encod¬ 
ing  in  the  packet  header.  The  latter  part  of  the  chapter  discusses  a  light  weight  monitoring  tool  that 
is  able  to  continuously  monitor  the  traffic  on  a  per  flow  basis  without  affecting  real-time  guarantees. 

Finally,  Chapter  9  summarizes  the  conclusions  of  the  dissertation,  discusses  the  limitations  of 
our  work,  and  ends  with  directions  for  future  work. 
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Chapter  2 


Background 


Over  the  past  decade,  two  classes  of  solutions  have  been  proposed  to  provide  better  network  ser¬ 
vices  than  the  existing  best  effort  service  in  the  Internet:  those  maintaining  the  stateless  property 
of  the  original  Internet  (e.g.,  Differentiated  Services),  and  those  requiring  a  new  stateful  architec¬ 
ture  (e.g.,  Integrated  Services).  While  stateful  solutions  can  provide  more  powerful  and  flexible 
services  such  as  per  flow  guaranteed  services,  and  can  achieve  higher  resource  utilization,  they  are 
less  scalable  than  stateless  solutions.  On  the  other  hand,  while  stateless  solutions  are  much  more 
scalable,  they  offer  weaker  services.  In  this  chapter,  we  first  present  all  the  mechanisms  that  a  router 
needs  to  implement  in  order  to  support  these  solutions,  and  then  discuss  in  detail  the  implementation 
complexity  of  each  solution  and  the  service  quality  it  achieves. 

The  remainder  of  this  chapter  is  organized  as  follows.  Section  2.1  discusses  the  two  main  com¬ 
munication  models  proposed  in  the  literature:  circuit  switching  and  packet  switching.  Section  2.2 
presents  the  Internet  Protocol  (IP)  network  model,  the  foundation  of  today’s  Internet.  In  particular, 
the  section  discusses  the  operations  performed  by  existing  and  the  next  generation  routers  on  both 
the  data  and  control  paths.  Data  path  consists  of  all  operations  performed  by  a  router  on  a  packet 
as  the  packet  is  forwarded  to  its  destination,  and  includes  packet  forwarding,  packet  scheduling, 
and  buffer  management.  Control  path  consists  of  the  operations  and  protocols  used  to  initialize  and 
maintain  the  state  required  to  implement  the  data  path  functionalities.  Examples  of  control  path 
operations  are  constructing  and  maintaining  the  routing  tables,  and  performing  admission  control. 
Section  2.3  presents  a  taxonomy  of  services  in  a  packet  switching  network.  Based  on  this  taxonomy, 
we  discuss  some  of  the  most  prominent  services  proposed  in  the  context  of  the  Internet:  the  best 
effort  service,  flow  protection.  Integrated  Services,  and  Differentiated  Services.  We  then  compare 
these  solutions  in  terms  of  the  quality  of  service  they  provide  and  their  complexity.  Section  2.4 
concludes  this  chapter  by  summarizing  our  findings. 


2.1  Circuit  Switching  vs.  Packet  Switching 

Communication  networks  can  be  classified  into  two  broad  categories:  packet  switching  and  circuit 
switching.  Circuit  switching  networks  are  best  represented  by  telephone  networks,  first  developed 
more  than  100  years  ago.  In  these  networks,  when  two  end  points  need  to  communicate,  a  dedicated 
channel  (circuit)  is  set  up  between  them.  The  channel  remains  open  for  the  entire  duration  of  the 
call,  no  matter  whether  the  channel  is  actually  used  or  not. 
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Packet  switching  networks  are  best  exemplified  by  the  Asynchronous  Transport  Mode  (ATM) 
and  Internet  Protocol  (IP)  networks.  In  these  networks  information  is  earned  by  packets.  Each 
packet  is  switched  and  transmitted  through  the  network  based  on  the  information  contained  in  the 
packet  header.  At  the  destination,  the  packets  are  reassembled  to  reconstnjct  the  original  informa¬ 
tion. 

The  most  important  advantage  of  packet  switching  over  circuit  switching  is  the  ability  to  exploit 
statistical  multiplexing.  Unlike  circuit  switching  where  no  one  can  use  an  open  channel  if  its  end¬ 
points  do  not  use  it,  with  packet  switching,  active  sources  can  use  any  excess  capacity  made  available 
by  the  inactive  sources.  In  a  networking  environment  with  bursty  traffic,  allowing  sources  to  share 
network  resources  can  significantly  increase  network  utilization.  Indeed,  a  recent  study  shows  that 
the  ratio  between  the  peak  and  the  average  rate  is  3:1  for  audio  traffic,  and  as  high  as  15:1  for  data 
traffic  [88]. 

The  main  drawback  of  packet  switching  networks  is  that  statistical  multiplexing  can  lead  to 
congestion.  Network  congestion  happens  when  the  arrival  rate  temporary  exceeds  the  link  capacity. 
In  such  a  case,  the  network  has  to  decide  which  traffic  to  drop,  and  which  to  transmit.  In  addition, 
end  hosts  are  either  expected  to  implement  some  form  of  congestion  control,  that  is,  to  reduce  their 
sending  rates  when  they  detect  congestion  in  the  network,  or  to  avoid  congestion  by  making  sure 
that  they  do  not  send  more  traffic  than  the  available  capacity  of  the  network. 

Due  to  its  superior  flexibility  and  resource  usage,  the  majority  of  today's  networks  are  based 
on  packet  switching  technologies.  The  most  prominent  packet  switching  architectures  are  Asyn¬ 
chronous  Transfer  Mode  [3,  12],  and  Internet  Protocol  (IP)  [22].  ATM  uses  fixed  size  packets  called 
cells  as  the  basic  transmission  unit,  and  was  designed  from  the  ground  up  to  provide  sophisticated 
services  such  as  bandwidth  and  delay  guarantees.  In  contrast,  IP  uses  variable  size  packets,  and 
supports  only  one  basic  service:  best  effort  packet  delivery,  which  does  not  provide  any  timeliness 
or  reliability  guarantees.  Despite  the  advantages  of  ATM  in  terms  of  quality  of  service,  during  the 
last  decade  IP  has  emerged  as  the  dominant  architecture.  For  several  technical  and  political  reasons 
that  tiy  to  explain  this  outcome  see  Tanenbaum  [109]. 

As  a  result,  our  emphasis  in  this  dissertation  is  on  IP  networks.  While  our  SCORE/DPS  tech¬ 
niques  are  applicable  to  packet  switching  networks  in  general,  in  this  dissertation  we  examine  them 
exclusively  in  the  context  of  IP.  In  the  remainder  of  this  chapter,  we  first  present  the  Internet  Proto¬ 
col  (IP)  network  model,  which  is  the  foundation  of  today's  Internet,  and  then  we  consider  some  of 
the  major  proposals  to  provide  better  services  in  the  Internet,  and  discuss  their  trade-offs. 


2.2  IP  Network  Model 

The  main  service  provided  by  today’s  IP  network  is  to  deliver  packets  between  any  two  nodes  in  the 
network  with  a  “reasonable”  probability  of  success.  The  key  component  that  enables  this  service 
is  the  router.  Each  router  has  two  or  more  interfaces  that  attach  it  to  multiple  networks.  Routers 
forward  each  packet  based  on  the  destination  address  in  the  packet’s  header.  For  this  purpose,  each 
router  maintains  a  table,  called  routing  table,  that  maps  every  IP  address  to  an  interface  attached 
to  the  router.  Routing  tables  are  constructed  and  maintained  by  the  routing  protocol.  The  routing 
protocol  is  implemented  by  a  distributed  algorithm  whose  main  function  is  to  let  routers  learn  the 
reachability  of  any  host  in  the  Internet  along  a  “good”  path.  In  general,  the  term  of  “good”  applies 
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to  the  shortest^  path  to  a  node.  Thus,  ideally,  a  packet  travels  along  the  shortest  path  from  source  to 
destination. 

2.2.1  Router  Architecture 

As  noted  in  the  previous  section,  a  router  consists  of  a  set  of  input  interfaces  at  which  packets 
aiiive,  and  a  set  of  output  interfaces,  from  which  packets  depart.  The  input  and  output  interfaces  are 
interconnected  by  a  high  speed  fabric  that  allows  packets  to  be  transfered  from  inputs  to  outputs. 
The  main  parameter  that  characterizes  the  fabric  is  the  speedup.  The  speedup  is  defined  as  the  ratio 
between  (a)  the  maximum  transfer  rate  across  the  fabric  from  an  input  to  an  output  interface,  and 
(b)  the  capacity  of  an  input  (output)  link. 

As  a  packet  traverses  a  router,  the  packet  can  be  stored  at  input,  at  output,  or  at  both  the  input 
and  output  interfaces.  Based  on  where  a  router  can  store  packets,  routers  are  classified  as  input 
queueing,  output  queueing,  or  input-output  queueing. 

In  an  output-queueing  router,  when  a  packet  arrives  at  the  input,  it  is  immediately  transferred 
to  the  corresponding  output.  Since  packets  are  enqueued  and  scheduled  only  at  the  outputs,  this 
architecture  is  easy  to  analyze  and  understand.  For  this  reason,  most  analytical  studies  assume  an 
output-queueing  router  model. 

On  the  downside,  the  output-queueing  router  architecture  requires  a  speedup  as  high  as  n,  where 
n  is  the  number  of  inputs.  The  worst  case  scenario  occurs  when  all  inputs  simultaneously  receive 
packets  for  the  same  output.  Since  inputs  are  bufferless,  the  output  has  to  be  able  to  simultaneously 
receive  the  n  packets,  hence  the  speedup  of  n.  As  the  number  of  inputs  in  a  modem  router  is  quite 
large  (e.g.,  it  can  exceed  32),  building  high-speed  output-queueing  routers  is,  in  general,  infeasible. 
That  is  why  practically  all  of  today’s  routers  employ  some  sort  of  input-output  queueing.  By  being 
able  to  buffer  packets  at  the  inputs,  the  speedup  of  the  interconnection  fabric  can  be  significantly 
reduced.  However  this  comes  at  a  cost:  complexity.  Since  only  the  output  has  complete  knowledge 
of  how  packets  are  scheduled,  complex  distributed  algorithms  to  control  the  packet  transfer  from 
inputs  to  outputs  have  to  be  implemented.  Furthermore,  this  complexity  makes  the  router  behavior 
much  more  difficult  to  analyze. 

In  summary,  while  output-queueing  routers  are  more  tractable  for  analysis,  the  input  and  input- 
output  queueing  routers  are  more  scalable  and  therefore  easier  to  build.  Fortunately,  recent  work  has 
shown  that  a  large  class  of  algorithms  implemented  by  an  output  queueing  router  can  be  emulated 
by  an  input-output  queueing  router  which  has  an  internal  speedup  of  only  2  [21,  102].  Thus,  at 
least  in  principle,  it  is  possible  to  build  scalable  input-output  queueing  routers  that  can  emulate  the 
behavior  of  output  queueing  routers.  For  this  reason,  in  the  remainder  of  this  dissertation,  we  will 
assume  an  output  queueing  router  architecture. 

Next,  we  discuss  in  more  detail  the  output-queueing  router  architecture.  Specifically,  we  present 
all  the  operations  that  a  router  needs  to  perform  on  the  data  and  control  paths  in  order  to  implement 
currently  proposed  solutions  that  aim  to  provide  better  services  than  the  best  effort  service,  such  as 
Integrated  Services  and  Differentiated  Services. 

’The  most  common  metric  used  in  today’s  Internet  is  the  number  of  routers  (hops)  on  the  path. 
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Figure  2.1 :  The  architecture  of  a  router  that  provides  per  flow  quality  of  service  (QoS).  Input  interfaces  use 
routing  lookup  or  packet  classification  to  select  the  appropriate  output  interface  for  each  incoming  packet, 
while  output  interfaces  implement  packet  classification,  buffer  management,  and  packet  scheduling.  In  to- 
day's  best  effort  routers,  neither  input  nor  output  interfaces  implement  packet  classification. 

2.2,2  Data  Path 

Data  path  represents  the  set  of  operations  performed  by  routers  on  a  data  packet  as  the  packet 
travels  from  source  to  destination.  The  main  functions  performed  by  routers  on  the  data  path  are: 
(1)  routing  lookup,  (2)  buffer  management,  and  (3)  packet  scheduling.  Routing  lookup  identifies 
the  output  interface  where  to  forward  each  incoming  packet,  based  on  the  destination  address  in  the 
packet’s  header.  Buffer  management  and  scheduling  are  concerned  with  managing  router  resources 
in  case  of  congestion.  In  particular,  when  the  buffer  overflows,  or  when  it  exceeds  some  predefined 
thi'eshold,  the  router  has  to  decide  what  packet  to  drop.  Similarly,  when  there  is  more  than  one 
packet  in  the  buffer,  the  router  has  to  decide  what  packet  to  transmit  next.  Usually,  today’s  routers 
implement  a  simple  drop-tail  buffer  management  scheme,  that  is,  when  the  buffer  overflows,  the 
packet  at  the  tail  of  the  queue  is  dropped.  Packets  are  scheduled  on  a  First-In-First-Out  (FIFO) 
basis. 

However,  cun*ently  proposed  solutions  to  provide  more  sophisticated  services  than  the  best  ef¬ 
fort  service,  such  as  per  flow  bandwidth  and  delay  guarantees,  require  routers  to  perform  a  fourth 
function:  (4)  packet  classification.  Packet  classification  consists  of  mapping  each  incoming  packet 
to  the  flow  it  belongs  to.  We  use  the  term  flow  to  denote  a  subset  of  packets  that  travel  from  one 
node  to  another  node  in  the  network.  Since  both  routing  lookup  and  packet  classification  can  be 
used  to  detemiine  to  which  output  interface  a  packet  is  forwarded,  in  the  remainder  of  this  section 
we  refer  to  both  these  operations  as  packet  fonvarding  operations.  Figure  2.1  depicts  the  relation¬ 
ship  between  the  four  functions  in  an  output-queueing  router  that  performs  per  flow  management. 
In  the  remainder  of  this  section,  we  present  these  functions  in  more  detail.  Since  currently  proposed 
solutions  to  provide  per  flow  service  semantics  require  routers  to  maintain  and  manage  per  flow 
state,  we  will  elaborate  on  the  complexity  of  these  routers. 
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2.2.2. 1  Packet  Forwarding:  Routing  Lookup  and  Packet  Classification 

Packet  forwarding  is  the  main  and  the  most  complex  function  peiformed  by  today’s  routers  on  the 
data  path.  This  function  consists  of  forwarding  each  incoming  packet  to  the  corresponding  output 
interface  based  on  the  fields  in  the  packet  header.  Virtually  all  routers  in  today’s  Internet  forward 
packets  based  on  their  destination  addresses.  The  process  of  finding  the  appropriate  output  port 
based  on  the  packet  destination  address  is  called  routing  lookup.  However,  to  implement  more 
sophisticated  functionalities  such  as  providing  better  services  to  selected  customers  or  filtering  out 
some  categories  of  traffic  to  enter  the  network,  routers  may  need  to  use  additional  fields  in  the  packet 
headers  to  distinguish  between  different  traffic  classes.  Examples  of  such  fields  are  the  source 
address  to  identify  the  incoming  traffic  of  a  selected  customer,  and  the  destination  port  number  to 
identify  the  traffic  of  different  applications.  The  process  of  finding  the  class  to  which  the  packet 
belongs  to  is  called  packet  classification.  Note  that  routing  lookup  is  a  particular  case  of  packet 
classification  in  which  packets  are  classified  based  on  one  field:  the  destination  address.  In  the 
remainder  of  this  section  we  discuss  in  more  detail  routing  lookup  and  packet  classification. 

Routing  Lookup  With  routing  lookup,  each  router  maintains  a  table,  called  routing  table,  that  maps 
each  IP  address  to  an  output  interface.  At  the  minimum,  each  entry  in  the  routing  table  consists  of 
two  fields.  The  first  field  contains  an  address  prefix,  and  the  second  field  contains  the  identifier  of 
an  output  interface.  The  address  prefix  specifies  the  range  of  all  IP  addresses  that  share  the  same 
prefix.  Upon  a  packet  arrival,  the  router  searches  its  routing  table  for  the  longest  prefix  that  matches 
the  packet’s  destination  address,  and  then  forwards  the  packet  to  the  output  interface  specified  by 
the  second  field  in  the  same  entry.  Thus,  routing  lookup  consists  of  a  search  operation  that  retrieves 
the  longest  prefix  match. 

To  minimize  the  size  of  the  routing  table,  IP  addresses  are  assigned  in  blocks  based  on  their  pre¬ 
fixes  [41].  As  a  result,  the  size  of  the  largest  routing  tables  today  is  about  70,000  entries  [122],  which 
is  three  orders  of  magnitude  smaller  than  the  total  number  of  hosts,  which  is  about  72  million  [89]. 

Traditional  algorithms  to  implement  the  routing  lookup  are  based  on  Patricia  tries  [72].  In  the 
simplest  form,  Patricia  tries  are  binary  trees  in  which  each  node  represents  a  binary  string  that 
encodes  the  path  from  the  tree’s  root  to  that  node.  As  an  example,  consider  such  a  tree  in  which  all 
left  branches  are  labeled  by  0,  and  all  right  branches  are  labeled  by  1.  Then,  string  010  corresponds 
to  the  node  that  can  be  reached  by  walking  down  the  tree  from  the  root,  first  along  the  left  branch, 
then  along  the  right  branch,  and  finally  along  the  left  branch.  In  the  case  of  Patricia  tries  used  to 
implement  routing  lookup,  each  leaf  node  represents  an  address  prefix.  Since  the  height  of  the  tree 
is  bounded  by  the  address  size  s,  the  worst  case  time  complexity  of  the  lookup  operation  is  0(5). 
However,  recent  developments  have  significantly  reduced  this  complexity.  In  particular,  Waldvogel 
et  al.  [115]  proposes  a  routing  lookup  algorithm  that  scales  with  the  logarithm  of  the  address  size, 
while  Degermark  et  al.  [30]  proposes  a  routing  lookup  algorithm  tuned  for  IPv4  that  takes  less  than 
100  instructions  on  an  Alpha  processor,  and  uses  only  up  to  eight  memory  references.  Furthermore, 
by  using  a  hardware  implementation,  Gupta  et  al.  [50]  proposes  a  pipelined  architecture  that  can 
perform  a  routing  lookup  every  memory  cycle.  However,  these  improvements  do  not  come  for  free. 
The  complexity  of  updating  the  routing  table  in  these  algorithms  is  much  higher  than  in  the  case 
of  the  algorithms  based  on  Patricia  tries.  Nevertheless,  this  tradeoff  is  justified  by  the  fact  that,  in 
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practice,  updates  are  much  less  frequent  than  lookups. 

In  summary,  today  it  is  possible  to  perform  a  routing  lookup  at  the  line  speed,  that  is,  without 
slowing  down  a  router  that  otherwise  peifonus  only  packet  queuing  and  dequeuing.  Furthermore, 
this  is  expected  to  remain  tiue  in  the  foreseeable  future.  Even  if  the  Internet  continues  to  expand  at 
its  cuirent  rate,  due  to  address  aggregation,  the  routing  tables  are  likely  to  remain  relatively  small. 
Assuming  that  the  cuirent  ratio  between  the  number  of  hosts  and  the  size  of  the  routing  tables  will 
not  change,  and  that,  as  predicted,  the  number  of  hosts  will  reach  one  billion  by  2008  [88],  we 
expect  that  routing  table  size  will  increase  by  a  factor  of  about  16  over  the  next  eight  years.  While 
this  increase  might  seem  considerable,  it  should  be  more  than  compensated  for  by  the  increase  in 
computer  processing  power  and  memory  capacity.  Indeed,  according  to  Moore's  law,  during  the 
same  time  span  the  semiconductor  performances  are  expected  to  improve  40  times. 

Packet  Classification  Cuirent  proposed  solutions  to  provide  Quality  of  Seiwice  (QoS)  such  as 
bandwidth  and  delay  guarantees,  require  routers  to  maintain  and  manage  per  flow  state.  That  is, 
upon  a  packet  airival,  the  router  has  to  classify  it  to  the  class  or  flow  the  packet  belongs  to.  A 
class  is  usually  defined  by  a  filter.  A  filter  consists  of  a  set  of  partially  specified  fields  that  define  a 
region  in  the  packet  space.  Common  fields  used  for  packet  classification  are  source  and  destination 
IP  addresses,  source  and  destination  port  numbers,  and  the  protocol  type.  An  example  of  filter  is 
{si'C-addr  =  128.16.120.x,  dst-addr  —  234.16.120.x,  dst^port  =  x,  src.port  =1000-1200, 
proto JAjpe  =  x),  where  x  stands  for  “don’t  care”.  This  filter  represents  the  entire  traffic  going 
from  subnet  123. 16. 120.x  to  subnet  234.16.120.x  with  the  destination  port  in  the  range 
10  00-12  00.  As  an  example,  the  packet  identified  by  [src^addr  —  123,16.120.12,  dst.addr  = 
234,16.120.2,  dst-port  =  21,  src^yort  =1080,  proto Jypc  =  TCP)  belongs  to  this  class,  while 
a  packet  sent  by  a  host  with  the  IP  address  15.14.51.12  does  not.  It  is  worth  noting  that  routing 
is  just  a  particular  case  of  packet  classification,  in  which  each  filter  is  specified  by  only  one  field: 
dstjaddr. 

It  should  come  as  no  surprise  that  the  classification  problem  is  inherently  difficult.  Current 
solutions  [51,  66,  96,  97]  work  well  only  for  a  relatively  small  number  of  classes,  i.e.,  no  more  than 
several  thousand.  This  is  because,  as  noted  by  Gupta  and  McKeown  [51],  the  packet  classification 
problem  is  similar  to  the  point  location  problem  in  the  domain  of  computation  geometry.  Given 
a  point  in  an  F  dimensional  space,  this  problem  asks  to  find  the  enclosing  region  among  a  set  of 
regions.  In  the  case  of  non-overlapping  regions,  the  best  bounds  for  n  regions  in  an  F  dimensional 
space  are  O(logn)  in  time  and  O(n^)  in  space,  or,  alternatively,  0(log^“^  n)  in  time  and  0{n)  in 
space.  This  suggests  a  clear  trade-off  between  space  and  time  complexities.  It  also  suggests  that  it  is 
very  hard  to  simultaneously  achieve  both  speed  and  efficient  memory  usage.  Worse  yet,  the  packet 
classification  problem  is  even  more  difficult  than  the  traditional  point  location  problem  as  it  allows 
class  (region)  overlapping. 

2.2.2.2  Buffer  Management 

IP  routers  are  based  on  a  store-and-forward  architecture,  i.e.,  when  a  packet  arrives  at  a  router,  the 
packet  is  first  stored  in  a  buffer,  and  then  forwarded.  Since,  in  practice,  buffers  are  finite,  the  routers 
have  to  cope  with  the  possibility  of  packet  loss.  Even  with  infinite  buffer  capacity,  there  might  be 
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the  neejd  to  drop  packets,  as  some  congestion  control  schemes,  such  as  TCP,  rely  on  packet  loss  to 
detect  network  congestion. 

Any  buffer  management  scheme  has  to  answer  two  questions:  (1)  when  is  a  packet  dropped?, 
and  (2)  which  packet  is  dropped?  In  addition,  per  flow  buffer  management  schemes  have  to  answer 
a  third  question:  (3)  which  queue  to  drop  from?  Examples  of  policies  that  answer  the  first  question 
are:  drop  a  packet  when  the  buffer  overflows  (e.g.,  drop-tail),  or  when  the  average  occupancy  of  the 
buffer  exceeds  some  thi'eshold.  Examples  of  policies  that  answer  the  second  question  are:  drop  the 
last  packet  in  the  queue,  the  first  packet  in  the  queue,  or  a  random  packet.  Finally,  an  example  of 
policy  that  answers  the  last  question  is  to  drop  a  packet  from  the  longest  queue. 

While  simple  network  services  can  be  implemented  by  using  a  single  queue  which  is  shared 
by  all  flows,  solutions  that  provide  more  powerful  services  such  as  per  flow  bandwidth  and  delay 
guarantees  require  routers  to  maintain  and  manage  a  separate  queue  for  each  flow.  In  this  case, 
the  most  expensive  operation  is  usually  to  answer  question  (3),  that  is,  to  choose  the  queue  to  drop 
from.  As  an  example,  an  algorithm  that  implements  a  policy  that  drops  the  packet  from  the  longest 
queue  has  O(logn)  complexity,  where  n  is  the  number  of  non-empty  queues.  However,  in  practice, 
this  complexity  can  be  significantly  reduced  by  grouping  the  queues  that  have  the  same  size,  or  by 
approximating  the  algorithm  [108]. 

2.2.2.3  Packet  Scheduling 

The  job  of  the  packet  scheduler  is  to  decide  what  packet  to  transmit,  if  any,  when  the  output  link 
becomes  idle.  In  routers  that  maintain  per  flow  state  this  is  accomplished  in  two  steps:  (1)  select  a 
flow  that  has  a  packet  to  send,  and  (2)  transmit  a  packet  from  the  flow’s  queue. 

Packet  scheduling  disciplines  are  classified  into  two  broad  categories:  work  conserving  and 
non-work  conserving.  In  a  work  conserving  discipline,  the  output  link  is  busy  as  long  as  there  is 
at  least  one  packet  in  the  system  destined  for  that  output.  In  contrast,  in  a  non-work  conserving 
discipline  it  is  possible  for  an  output  link  to  be  idle,  despite  the  fact  that  there  are  packets  in  the 
system  destined  for  that  output.  Virtually  all  routers  in  today’s  Internet  are  work-conserving,  and 
implement  a  simple  FIFO  scheduling  discipline.  However,  solutions  to  support  better  services  than 
best  effort,  such  as  bandwidth  and  delay  guarantees,  require  more  sophisticated  packet  scheduling 
schemes.  Examples  of  such  schemes  that  are  work  conserving  are:  Static  Priority  [123],  Weighted 
Round  Robin  [52],  Virtual  Clock  [127],  Weighted  Fair  Queueing  [31],  and  Delay  Earliest  Deadline 
Due  [124].  Similarly,  examples  of  non- work  conserving  disciplines  are:  Stop-and-Go  [44],  Jitter- 
Virtual  Clock  [126],  Hierarchical  Round  Robin  [63],  and  Rate  Controlled  Static  Priority  [123]. 

Many  of  the  simpler  disciplines  such  as  PTFO,  Static  Priority,  and  Weighted  Round  Robin  can 
be  easily  implemented  by  constant  time  algorithms,  i.e.,  algorithms  that  take  0(1)  time  to  process 
each  packet.  In  contrast,  the  more  sophisticated  scheduling  disciplines  such  as  Virtual  Clock  and 
Weighted  Fair  Queueing  are  significantly  more  complex  to  implement.  In  general,  the  algorithms 
to  implement  these  disciplines  associate  with  each  flow  a  unique  parameter  that  is  used  to  select 
the  flow  to  be  served.  Examples  of  such  a  parameter  are  the  flow’s  priority,  and  the  deadline  of 
the  packet  at  the  head  of  the  queue.  Flow  selection  is  usually  implemented  by  selecting  the  flow 
with  the  largest  or  the  smallest  value.  This  can  be  accomplished  by  maintaining  a  priority  queue 
data  structure  in  which  the  time  complexity  of  selecting  a  flow  is  O(logn),  where  n  represents  the 
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number  of  flows  in  the  queue. 

Non-work-conserving  disciplines,  as  well  as  some  of  the  more  complex  work-conserving  dis¬ 
ciplines,  may  employ  a  second  parameter.  The  puipose  of  the  second  parameter  is  to  determine 
whether  the  flow  with  a  non-empty  queue  is  allowed  to  send  or  not.  An  example  of  such  a  param¬ 
eter  is  the  eligible  time.  The  packet  at  the  head  of  the  queue  can  be  transmitted  only  if  its  eligible 
time  is  smaller  or  equal  to  the  system  time.  Obviously,  the  addition  of  a  second  parameter  increases 
the  implementation  complexity.  In  many  cases,  the  implementation  is  divided  into  two  parts:  a 
rate  controller  that  stores  packets  until  they  become  eligible,  and  a  scheduler  that  selects  the  flow’s 
packet  to  be  transmitted  based  on  the  first  parameter  (e.g.,  deadline).  Since  the  rate  controller  is 
usually  implemented  by  constant  time  algorithms  [10],  the  overall  complexity  of  selecting  a  packet 
is  generally  dominated  by  the  scheduling  algorithm. 

Once  a  flow  is  selected,  one  of  its  packets  is  transmitted  -  usually  the  packet  at  the  head  of  the 
queue  -  and  the  parameter(s)  associated  with  the  flow  are  eventually  updated. 

2.2.3  Control  Path 

The  control  path  consists  of  all  functions  and  operations  performed  by  the  network  to  set  up  and 
maintain  the  state  required  by  the  data  path.  These  functions  are  implemented  by  routing  and  sig¬ 
naling  protocols. 

2.2.3.1  Routing  Protocol 

The  puipose  of  routing  protocols  is  to  set  up  and  maintain  routing  tables  of  all  routers  in  a  network. 
Routing  protocols  are  implemented  by  distributed  algorithms  that  try  to  learn  the  reachability  of  any 
host  in  the  network.  In  the  Internet,  routing  protocols  are  organized  in  a  two  level  hierarchy. 

At  the  higher  level,  the  Internet  consists  of  a  large  number  of  interconnected  autonomous  sys¬ 
tems  (ASs).  An  AS  represents  a  distinct  routing  domain,  which  is  usually  administrated  by  a  single 
organization  such  as  a  company  or  university.  ASs  are  connected  via  gateways,  which  use  inter- 
domain  routing  protocols  to  exchange  routing  information  about  which  hosts  are  reachable  by  each 
AS.  As  a  result,  each  gateway  constructs  a  routing  table  that  maps  each  IP  address  to  a  neighbor  AS 
that  knows  a  path  to  that  IP  address.  The  most  common  inter-domain  routing  protocol  in  use  today 
is  Border  Gateway  Protocol  (BGP)  [86]. 

At  the  lower  level  within  an  AS,  routers  communicate  with  each  other  using  an  intra-domam 
routing  protocol.  The  purpose  of  these  protocols  is  to  enable  routers  to  exchange  locally  obtained 
information  so  that  all  routers  within  an  AS  have  coherent  and  up  to  date  information  needed  to 
reach  any  host  within  the  AS.  Examples  of  intra-domain  routing  protocols  are  Routing  Information 
Protocol  (RIP)  [54],  and  Open  Shortest  Path  First  (OSPF)  [73]. 

The  division  of  routing  protocols  into  intra-  and  inter-domain  is  crucial  for  the  scalability  of  the 
Internet.  On  one  hand,  this  allows  the  deployment  of  sophisticated  inter-routing  protocols  which  can 
gather  an  accurate  picture  of  the  host  reachability  within  an  AS.  On  the  other  hand,  the  inter-domain 
routing  protocols  present  a  much  coarser  information  about  host  reachability.  Unlike  intra-domain 
routing  protocols  that  specify  the  path  at  the  router  granularity,  these  protocols  specify  the  path  at 
the  AS  granularity.  This  tradeoff  gives  an  organization  maximum  flexibility  in  managing  its  own 
resources,  without  compromising  routing  scalability  at  the  level  of  the  entire  Internet.  Some  key 
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factors  affecting  routing  scalability,  as  well  as  some  basic  principles  of  designing  scalable  routing 
protocols  are  presented  by  Yo  [122]. 

In  summary,  as  proved  by  the  Internet’s  own  existence,  the  hierarchical  routing  architecture  is 
both  scalable  and  robust.  However,  it  should  be  noted  that  one  of  the  main  motivations  behind  these 
desirable  properties  is  the  weak  semantic  of  the  best  effort  service.  The  best  effort  service  does  not 
provide  any  reliability  or  timeliness  guarantees.  As  long  as  a  “reasonable”  number  of  packets  reach 
their  destinations,  packet  loss  and  packet  reordering  ai-e  acceptable.  As  a  result,  route  changes, 
route  oscillations,  or  even  router  failures  do  not  necessary  compromise  the  service.  In  contrast,  with 
stronger  service  semantics  such  as  the  guaranteed  service,  existing  routing  protocols  are  not  good 
enough.  The  next  section  discusses  these  issues  in  more  detail. 

2.23.2  Signaling  Protocol 

To  implement  more  sophisticated  services  such  as  per  flow  delay  and  bandwidth  guarantees,  we 
need  the  ability  to  perform  admission  control  and  route  pinning.  The  task  of  admission  control 
is  to  reserve  enough  resources  on  the  path  from  source  to  destination  in  order  to  meet  the  service 
requirements.  In  turn,  route  pinning  makes  sure  that  all  packets  of  the  flow  traverse  the  path  on 
which  resources  have  been  reserved.  Traditionally,  these  two  functionalities  are  implemented  by 
signalling  protocols  such  as  Tenet  Real-Time  Channel  Administration  Protocol  (RCAP)  [8,  34],  or 
RSVP  [128].  The  complexity  of  signaling  protocols  is  primary  due  to  the  difficulty  of  maintaining 
the  state  consistent  in  a  distributed  environment.  In  the  remainder  of  this  section  we  discus  this  issue 
in  the  context  of  both  admission  control  and  route  pinning. 

Admission  Control  Admission  control  makes  sure  that  there  are  enough  network  resources  on 
the  path  from  source  to  destination  to  meet  the  service  requirements,  such  as  delay  and  bandwidth 
guarantees.  To  better  understand  the  issues  with  admission  control  consider  the  following  exam¬ 
ple.  Assume  host  A  requests  bandwidth  reservation  for  a  flow  that  has  destination  B.  One  possible 
method  to  achieve  this  is  to  send  a  control  message  embedding  the  reservation  request  along  the  path 
from  A  to  B.  Upon  receiving  this  message,  each  router  along  the  path  checks  whether  it  has  enough 
resources  to  accept  the  reservation.  If  it  does,  it  allocates  the  required  resources  and  then  forwards 
the  message.  When  host  B  receives  this  message,  it  sends  back  an  acknowledgement  to  A.  The  reser¬ 
vation  is  considered  successful  if  and  only  if  all  routers  along  the  path  have  accepted  it;  otherwise 
the  reservation  is  rejected.  While  simple,  this  procedure  does  not  account  for  various  failures  such 
as  packet  loss  and  partial  reservation  failures.  Partial  reservation  failures  occur  when  only  a  subset 
of  routers  along  the  path  accept  the  reseiwation.  In  this  case,  the  protocol  has  to  undo  the  reservation 
at  the  routers  that  have  accepted  it.  To  handle  packet  loss,  when  a  router  receives  a  reservation 
request  message,  the  router  has  to  be  able  to  tell  whether  it  is  a  duplicate  of  a  message  already  pro¬ 
cessed  or  not.  To  handle  partial  reservation  failures,  a  router  needs  to  remember  the  decision  made 
for  the  reservation  request  in  a  previous  pass.  For  these  reasons,  all  existing  solutions  maintain  per 
flow  reservation  state,  be  it  hard  state  as  in  ATM  UNI  [1],  Tenet  Real-Time  Channel  Administration 
Protocol  (RCAP)  [8,  34],  or  soft  state  as  in  RSVP  [128].  However,  maintaining  consistent  and  dy¬ 
namic  state  in  a  distributed  environment  is  in  itself  a  challenging  problem.  Fundamentally,  this  is 
because  admission  control  assumes  a  transaction-like  semantic,  which  is  very  difficult  to  achieve  in 
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a  distributed  system  in  the  presence  of  message  losses  and  arbitrary  delays  [4,  117]. 

Route  Pinning  Once  a  flow’s  reservation  request  is  accepted,  the  source  can  start  sending  data 
packets.  However,  to  meet  the  performance  requirements  negotiated  during  the  admission  control, 
we  have  to  make  sure  that  all  packets  of  a  flow  traverse  the  same  path.  Othewise,  a  packet  can 
traverse  a  path  that  does  not  have  enough  resources,  which  will  lead  to  service  violation.  The  oper¬ 
ation  of  binding  a  flow  to  a  path  (route)  is  called  route  pinning.  Whenever  the  underlying  routing 
protocol  does  not  support  route  pinning,  this  functionality  can  be  provided  by  the  signalling  proto¬ 
cols  together  with  the  admission  control.  For  example,  in  RS  VP,  when  a  node  accepts  a  reservation, 
it  also  stores  the  next  hop  on  the  flow’s  path  in  its  database.  Since  these  protocols  maintain  per  flow 
state,  augmenting  this  state  to  store  the  next  hop  does  not  increase  their  complexity. 

Alternatively,  route  pinning  can  be  separated  from  admission  control.  One  example  is  ATM  [12] 
whose  routing  protocol  natively  supports  route  pinning.  Another  example  is  Multi-Protocol  Label 
Switching  (MPLS),  recently  proposed  to  perform  traffic  engineering  in  the  Internet  [17].  In  both 
ATM  and  MPLS,  the  main  idea  is  to  perform  routing  based  on  identifiers  that  have  local  meaning, 
instead  of  identifiers  that  have  global  meaning  such  as  IP  addresses.  Each  router  maintains  a  table 
which  maps  each  local  identifier  (label)  to  an  output  interface.  Each  packet  carries  a  label  that 
specifies  how  the  packet  is  to  be  routed  at  the  next  hop.  Before  forwarding  a  packet,  a  router 
replaces  the  existing  label  with  a  new  label  that  is  used  by  the  next  hop  to  route  the  packet.  Note 
that  this  requires  a  router  to  also  store  the  labels  used  by  its  neighbors.  Besides  route  pinning,  one 
other  major  advantage  of  routing  based  on  labels,  instead  of  IP  addresses,  is  performance.  Instead 
of  searching  for  the  longest  prefix  match,  we  only  have  to  search  for  an  exact  match,  which  is  much 
faster  to  implement. 

On  the  downside,  these  routing  schemes  need  a  special  protocol  to  distribute  and  maintain  labels 
consistent.  While  in  this  case  routers  do  not  need  to  maintain  per  flow  state,  they  still  need  to 
maintain  per  path  state.  However,  in  practice,  the  number  of  paths  that  can  traverse  a  core  router  can 
be  still  quite  large.  In  the  worst  case,  this  number  increases  with  the  square  of  the  number  of  edge 
nodes.  Thus,  in  the  case  of  an  AS  that  has  hundreds  of  edge  nodes,  this  number  can  be  on  the  order 
of  hundred  of  thousands.  Finally,  label  distribution  protocols  have  to  address  the  same  challenges  as 
other  distributed  algorithms  that  need  to  maintain  state  consistent  in  the  presence  of  link  and  router 
failures,  such  as  Tenet  RCAP  and  RSVP. 

2.2.4  Discussion 

Among  all  the  operations  performed  by  routers  on  the  data  path,  packet  classification  is  arguably 
the  most  complex.  As  discussed  in  Section  2.2.2. 1,  algorithms  to  solve  this  problem  require  at  least 
O(logn)  time  and  0(n^)  space,  or,  alternatively,  at  least  0(log^“')  time  and  0{n)  space,  where 
n  represents  the  number  of  classes,  and  F  represents  the  number  of  fields  in  a  filter. 

In  contrast,  most  buffer  management  and  packet  scheduling  algorithms  have  0{n)  space  com¬ 
plexity  and  O(logn)  time  complexity.  By  trading  resource  utilization  for  speed,  we  can  further 
reduce  the  time  complexity  to  0(loglog7i)  or  even  0(1).  For  example,  [98]  proposes  an  imple¬ 
mentation  of  Weighted  Fair  Queueing  with  0(1)  time  complexity. 

The  important  point  to  note  here  is  that  our  DPS  technique  trivially  eliminates  the  most  complex 
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operation  performed  by  core  routers  on  the  data  path:  packet  classification.  This  is  because,  with 
DPS,  the  state  required  to  process  packets  is  carried  by  the  packets  themselves,  instead  of  being 
maintained  by  core  routers  (see  Section  1.1).  Consequently  core  routers  do  not  need  to  perform  any 
packet  classification. 

On  the  control  path  the  most  complex  operation  is  arguably  the  admission  control  for  which 
current  solutions  require  routers  to  maintain  per  flow  state.  The  main  difficulty  is  to  maintain  the 
consistency  of  the  distributed  state  in  the  presence  of  packet  losses,  arbitraiy  packet  delays,  and 
router  failures. 

Again,  the  main  benefit  of  using  DPS  is  that  by  eliminating  the  need  for  core  routers  to  maintain 
per  flow  state,  we  trivially  eliminate  the  need  of  maintaining  this  state  consistent. 

2.3  Network  Service  Taxonomy 

In  this  section  we  present  a  general  taxonomy  of  services  in  a  packet  switching  network,  and  then 
use  this  taxonomy  to  describe  the  traditional  best  effort  service,  and  the  recently  proposed  services 
to  enhance  today’s  Internet.  We  then  describe  and  compare  the  existing  solutions  to  implement  these 
services. 

The  primary  goal  of  a  network  is  to  provide  services  to  its  end-hosts.  Services  are  usually 
classified  along  two  axes:  (a)  the  granularity  of  the  network  abstraction  to  which  the  service  applies, 
and  (b)  the  “quality”  of  the  service. 

As  the  name  suggests,  packet  switching  networks  are  centered  around  the  packet  abstraction.  A 
packet  represents  the  smallest  piece  of  information  that  can  be  routed  through  the  network.  At  a 
higher  level  of  granularity,  we  have  the  concept  of  a  flow,  A  flow  represents  a  subset  of  packets  that 
travel  between  two  nodes  in  the  network.  If  these  nodes  are  routers,  we  will  also  use  the  terminology 
of  macro-flow.  An  example  of  a  flow  is  the  traffic  of  a  TCP  connection,  while  an  example  of  a  macro¬ 
flow  is  the  traffic  between  two  sub-networks.  At  an  even  higher  level  of  abstraction,  we  have  traffic 
aggregates  over  multiple  destinations  or  sources.  Examples  of  traffic  aggregates  are  the  entire  web 
traffic  of  a  user,  or  the  entire  outgoing/incoming  traffic  of  an  organization. 

Along  the  second  axis,  a  service  is  described  by  a  set  of  properties  that  can  be  either  qualitative 
or  quantitative.  Examples  of  qualitative  properties  are  reliability  and  isolation.  Isolation  refers  to 
the  ability  of  the  network  to  protect  the  traffic  of  a  flow  against  malicious  sources  that  may  flood 
the  network.  Quantitative  properties  are  described  in  terms  of  performance  parameters  such  as 
bandwidth,  delay,  delay  jitter  and  loss  probability.  Usually,  these  parameters  are  reported  on  an 
end-to-end  basis.  For  example,  the  delay  represents  the  total  time  it  takes  a  packet  to  travel  from 
source  to  its  destination.  Similarly,  the  delay  jitter  represents  the  maximum  difference  between  the 
maximum  and  the  minimum  end-to-end  delays  experienced  by  any  two  packets  of  a  flow.  Note  that 
the  two  quantitative  and  qualitative  properties  are  not  necessary  orthogonal.  For  example,  a  service 
that  guarantees  a  zero  loss  probability  is  trivially  a  reliable  service. 

Quantitative  services  can  be  further  classified  into  absolute  and  relative  services.  Absolute 
services  specify  precise  quantities  that  bound  the  service  performance  parameters  such  as  worst  case 
bandwidth  or  delay.  In  contrast,  relative  services  specify  the  relative  difference  or  ratio  between  the 
performance  parameters.  Examples  of  absolute  services  are:  “flow  A  is  guaranteed  a  bandwidth  of 
2  Mbps”,  and  “the  loss  probability  of  flow  A  is  less  than  10“^”.  Examples  of  relative  services  are: 
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Service 

Network  Abstraction 

Service  Description 

Best  effort 

packet 

connectivity 

Flow  Protection 

flow 

protect  well-behaved  flows 
against  ill-behaved  ones 

Intserv 

Guaranteed 

flow 

bandwidth  and  delay  guarantees 

Controlled-Load 

flow 

“weak”  bandwidth  guarantees 

Diffserv 

Premium 

macro-flow 

bandwidth  guarantees 

Assured 

traffic  aggregate  over  multiple 
destinations/sources 

“weak”  bandwidth  guarantees 

Table  2. 1 :  A  taxonomy  of  services  in  IP  networks. 


“flow  A  has  twice  the  bandwidth  of  flow  B“,  and  “flow  A  has  a  packet  loss  twice  as  small  as  flow 

Next,  we  discuss  some  of  the  most  prominent  services  proposed  in  the  context  of  the  Internet: 
(1)  the  best  effort  service,  (2)  flow  protection  to  provide  network  support  for  congestion  control,  (3) 
Integrated  Services,  and  (4)  Differentiated  Services.  Table  2.1  shows  a  taxonomy  of  these  services. 

2.3.1  Best  Effort  Service 

Today’s  Intemet  provides  one  simple  service:  the  best  effort  service.  This  is  fundamentally  a  con¬ 
nectivity  service  which  allows  any  two  hosts  in  the  Intemet  to  communicate  by  exchanging  packets. 
As  the  name  suggests,  this  service  does  not  make  any  promise  of  whether  a  packet  is  actually  de¬ 
livered  to  the  destination,  or  whether  the  packets  are  delivered  in  order  or  not.  Such  a  minimalist 
service  requires  little  support  from  routers.  In  general,  routers  just  forward  packets  on  a  First-In 
First-Out  (FIFO)  basis.  Thus,  excepting  the  routing  state,  which  is  highly  aggregated,  a  router  does 
not  need  to  maintain  and  manage  any  fine  grained  state  about  traffic.  This  simple  architecture  has 
several  desirable  properties: 

Scalability  Since  the  only  state  maintained  by  routers  is  the  routing  state,  today’s  Intemet 
architecture  is  highly  scalable.  In  particular,  address  aggregation  allows  routers  to  maintain 
little  state  as  compai*ed  to  the  number  of  hosts  in  the  network.  For  example,  a  typical  router 
today  stores  less  than  70.  000  entries  [122]  which  is  several  orders  of  magnitude  lower  than 
the  number  of  hosts  in  the  Intemet,  which  is  around  72  million  [89]. 

Robustness  One  of  the  most  important  goals  in  designing  the  Intemet  was  robustness  [22]. 
In  particular,  the  requirement  was  that  two  end-hosts  should  be  able  to  communicate  despite 
router  and  link  failures,  and/or  network  reconfiguration.  The  only  case  in  which  two  hosts 
can  no  longer  communicate  is  when  the  network  between  the  two  hosts  is  partitioned.  The 
fact  that  the  state  of  a  flow  is  maintained  only  by  end-hosts  and  not  by  routers  makes  it 
significantly  easier  to  ensure  robustness,  as  router  failures  do  not  compromise  the  flow  state. 
Had  the  flow  state  been  kept  by  routers,  complex  algorithms  to  replicate  and  restore  this  state 
would  be  needed  to  handle  failures.  Furthermore,  such  algorithms  would  be  able  to  provide 
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protection  against  failures  only  if  the  number  of  routers  failing  is  smaller  than  the  number  of 
replicas  that  failed. 

There  is  one  caveat  with  respect  to  the  Internet  robustness  though.  It  can  be  argued  that,  to 
a  large  extent,  today’s  Internet  is  robust  mainly  because  it  provides  a  weak  service  semantic. 
Indeed,  as  long  as  the  “majority”  of  packets  still  reach  their  destination,  router  or  link  failures 
do  not  compromise  the  service.  In  contrast,  it  is  fundamentally  more  difficult  to  achieve 
robustness  in  the  case  of  a  strong  semantic  service  such  as  the  guaranteed  service.  In  this 
case,  a  router  or  link  failure  can  easily  compromise  the  service.  Note  that  even  if  back-up 
paths  were  used  to  restore  the  service,  time  sensitive  parameters  such  as  delay  may  still  be 
affected  during  the  recovery  process. 

Performance  The  simplicity  of  the  router  design  allows  efficient  implementation  at  very  high 
speeds.  Usually,  these  routers  implement  the  FIFO  scheduling  discipline  and  drop-tail  buffer 
management,  which  are  both  constant-time  operations. 

2.3.2  Flow  Protection:  Network  Support  for  Congestion  Control 

Because  of  their  reliance  on  statistical  multiplexing,  data  networks  such  as  the  Internet  must  pro¬ 
vide  mechanisms  to  control  congestion.  The  current  Internet  relies  on  end-to-end  congestion  control 
mechanisms  in  which  senders  reduce  their  transmission  rates  whenever  they  detect  congestion  in  the 
network.  The  most  widely  utilized  form  of  congestion  control  is  the  additive-increase/multiplicative- 
decrease  scheme  implemented  by  TCP  [57,  83],  a  scheme  which  has  proven  to  be  highly  success¬ 
ful  in  preventing  congestion  collapse^.  However,  the  viability  of  this  approach  depends  on  one 
fundamental  assumption:  all  end-hosts  cooperate  by  implementing  equivalent  congestion  control 
algorithms. 

While  this  was  a  reasonable  assumption  when  the  Internet  was  primarily  used  by  the  research 
community,  and  the  vast  majority  of  traffic  was  TCP  based,  this  is  no  longer  true  today.  The  emer¬ 
gence  of  new  multimedia  applications,  such  IP  telephony,  audio  and  video  streaming,  which  use 
more  aggressive  UDP  based  protocols,  negatively  affects  the  still  predominant  TCP  traffic.  Al¬ 
though  there  are  considerable  ongoing  efforts  to  develop  protocols  for  the  new  applications  that  are 
TCV  friendly  [6,  84,  85]  -  protocols  that  implement  TCP  like  congestion  control  algorithms  ~  these 
efforts  fail  to  address  the  fundamental  problem:  in  an  economic  environment  cooperation  is  not 
always  optimal  In  particular,  in  case  of  congestion,  the  natural  incentive  of  a  sender  is  to  send  more 
traffic  in  the  hope  that  it  will  force  other  senders  to  back-off,  and  as  a  result  it  will  be  able  to  use  the 
extra  bandwidth.  This  incentive  translates  into  a  positive  feed-back  behavior,  i.e.,  the  more  packets 
that  are  dropped  in  the  network,  the  more  packets  the  user  sends,  which  can  ultimately  lead  to  con¬ 
gestion  collapse.  It  is  interesting  to  note  that  this  problem  resembles  the  “tragedies  of  commons” 
problem,  well  known  in  the  economic  literature  [53]. 

Two  approaches  were  proposed  to  address  this  problem:  flow  identification  and  (2)  fair 

bandwidth  allocation.  Both  of  these  approaches  require  changes  in  the  routers.  In  the  following 
sections,  we  discuss  these  approaches  in  more  detail. 

^Congestion  collapse  occurs  when  sources  increase  their  sending  rates  when  they  experience  losses,  in  the  hope  that 
more  of  their  packets  will  get  through.  Eventually,  this  will  lead  to  a  further  increase  in  the  packet  loss,  and  result  in 
consistent  buffer  overflow  at  the  congested  routers. 
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2.3.2. 1  Identification  Approach 

The  main  idea  of  this  approach,  advocated  by  Floyd  and  Fall  [36],  is  to  identify  and  then  punish  the 
flows  that  are  ill-behaved.  In  short,  routers  employ  a  set  of  tests  to  identify  ill-behaved  flows.  When 
a  flow  is  identified  as  being  ill-behaved,  it  is  punished  by  preferentially  having  its  packets  dropped 
until  its  allocated  bandwidth  becomes  smaller  than  the  bandwidth  allocated  to  a  well-behaved  flow. 
In  this  way,  the  punishment  creates  the  incentive  for  end-hosts  to  send  well-behaved  traffic.  The 
obvious  question  is  how  to  identify  an  ill-behaved  flow. 

To  answer  this  question,  Floyd  and  Fall  [36]  propose  a  suite  of  tests,  which  tiy  to  detect  whether 
a  flow  is  TCP  friendly  or  not,  i.e.,  whether  the  behavior  of  a  flow  is  consistent  to  the  behavior  of  a 
TCP  flow  under  similar  conditions.  In  particular,  these  tests  estimate  the  round-trip  time  (RTT)  and 
the  packet  dropping  probability,  and  then  check  whether  the  throughput  of  a  flow  and  its  dynamics 
are  consistent  to  those  of  a  TCP  flow  having  the  same  RTT  and  experiencing  the  same  packet 
dropping  probability. 

While  this  approach  can  be  efficiently  implemented,  it  has  two  significant  drawbacks.  First, 
these  tests  are  generally  inaccurate  as  they  are  based  on  parameters  that  are  very  hard  to  estimate. 
For  example,  it  is  very  difficult  if  not  impossible  to  accurately  estimate  the  RTT  of  an  arbitrary  flow 
based  only  on  the  local  information  available  at  the  router,  as  assumed  by  Floyd  and  Fall  [36].  ^ 
Because  of  this,  current  proposals  simply  assume  that  the  RTT  is  twice  the  propagation  delay  on  the 
outgoing  link.  Clearly,  depending  on  the  router  position  on  the  path  of  the  flow,  this  procedure  can 
lead  to  major  under-estimations,  negatively  impacting  the  overall  accuracy  of  these  tests. 

Second,  this  approach  makes  the  implicit  assumption  that  all  existing  and  future  congestion 
protocol  algorithms  are  going  to  be  TCP  friendly.  From  an  architectural  standpoint,  this  assumption 
considerably  reduces  the  freedom  of  designing  and  building  new  protocols.  This  can  have  significant 
implications,  as  the  freedom  allowed  by  the  original  datagram  service,  one  of  the  key  properties  that 
has  contributed  to  the  success  of  the  Internet,  is  lost. 

2.3.2.2  Allocation  Approach 

In  this  approach  routers  employ  special  mechanisms  that  allocate  bandwidth  in  a  fair  manner.  Fair 
bandwidth  allocation  protects  well-behaved  flows  from  ill-behaved  ones,  and  is  typically  achieved 
by  using  per-flow  queueing  mechanisms  such  as  Fair  Queueing  [31,  79]  and  its  many  variants  [10, 
45,  94]. 

Unlike  the  identification  approach,  the  allocation  approach  allows  various  congestion  policies 
to  coexist.  This  is  because  no  matter  how  much  traffic  a  source  will  send  in  the  network,  it  is  not 
going  to  get  more  than  its  fair  allocation.  Unfortunately,  this  flexibility  does  not  come  for  free.  Fair 
allocation  mechanisms  are  complex  to  implement,  as  they  inherently  require  routers  to  maintain 
state  and  perform  operations  on  a  per  flow  basis.  In  contrast,  with  the  identification  approach, 
routers  need  to  maintain  state  only  for  the  flows  which  are  punished,  i.e.,  the  ill-behaved  flows. 

^While  a  possible  solution  would  be  to  have  the  end-hosts  sending  the  estimated  RTT  to  routers  along  the  flow’s  path, 
there  are  two  problems  with  this  approach.  First  it  requires  that  changes  be  made  to  the  end-hosts,  and  second,  there  is 
the  question  of  whether  a  router  can  trust  this  information. 
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2.3.3  Integrated  Services 


As  new  applications  such  as  IP  telephony,  video-conferencing,  audio  and  video  streaming,  and  dis¬ 
tributed  games  are  deployed  in  the  Internet,  seiwices  more  sophisticated  than  best  effort  aie  needed. 
Unlike  previous  applications  such  as  file  transfer,  these  new  applications  have  much  stricter  timeli¬ 
ness  and  bandwidth  requirements.  For  example,  to  enable  natural  interaction,  the  end-to-end  delay 
needs  to  be  below  human  perception.  Previous  studies  concluded  that  for  natural  hearing  this  delay 
should  be  around  100  ms  [64].  Since  in  a  global  network  the  propagation  delay  alone  is  about  100 
ms,  meeting  such  tight  delay  requirements  is  a  very  challenging  task  [7]. 

To  support  these  new  applications,  IETF  has  proposed  a  new  seiwice  model  called  Integrated 
Services  or  Intserv  [82].  Intserv  uses  flow  abstraction.  Two  services  were  defined  within  the  Intserv 
framework:  Guaranteed  and  Controlled-Load  services. 

2.3.3.1  Guaranteed  Service 

Guaranteed  service  is  the  strongest  semantic  service  proposed  in  the  context  of  the  Internet  so 
far  [93].  Guaranteed  service  has  the  ability  to  provide  per  flow  bandwidth  and  delay  guarantees. 
In  particular,  a  flow  can  be  guaranteed  a  minimum  bandwidth,  and,  given  the  arrival  process  of 
the  flow,  a  maximum  end-to-end  delay.  This  way.  Guaranteed  service  provides  ideal  support  for 
real-time  applications  such  as  IP  telephony. 

However,  this  comes  at  the  cost  of  a  significant  increase  in  complexity:  current  solutions  require 
routers  to  maintain  and  manage  per  flow  state  on  both  data  and  control  paths.  On  the  data  path,  a 
router  has  to  perform  per  flow  classification,  buffer  management  and  scheduling.  On  the  control 
path,  routers  have  to  maintain  per  flow  forwarding  state  and  perform  per  flow  admission  control. 
During  the  admission  control,  each  router  on  the  flow’s  path  reserves  network  resources,  such  as  the 
link  capacity  and  buffer  space,  to  make  sure  that  the  flow’s  bandwidth  and  delay  requirements  are 
met. 


2.3.3.2  Controlled-Load  Service 

For  applications  that  do  not  require  strict  service  guarantees,  IETF  has  proposed  a  weaker  se¬ 
mantic  service  within  the  Intserv  framework:  the  Controlled-Load  service.  As  defined  by  Wro- 
clawski  [121],  the  Controlled-Load  service  “tightly  approximates  the  behavior  visible  to  applica¬ 
tions  receiving  best  effort  service  *under  unloaded  conditions*  from  the  same  series  of  network 
elements".  More  precisely,  the  Controlled-Load  service  ensures  that  (1)  the  packet  loss  is  not  sig¬ 
nificantly  larger  than  the  basic  error  rate  of  the  transmission  medium,  and  (2)  the  end-to-end  delay 
experienced  by  a  very  large  percentage  of  packets  does  not  greatly  exceed  the  end-to-end  propaga¬ 
tion  delay.  The  Controlled-Load  service  is  intended  to  provide  better  support  for  a  broad  class  of 
applications  that  have  been  developed  for  use  in  today’s  Internet.  Among  the  applications  that  fall 
into  this  class  are  the  “adaptive  and  real-time  applications”  such  as  video  and  audio  streaming. 

While  the  Controlled-Load  service  still  requires  routers  to  perform  per  flow  admission  control 
on  the  control  path,  and  packet  classification,  buffer  management,  and  scheduling  on  the  data  path, 
some  of  these  operations  can  be  significantly  simplified.  For  example,  the  scheduling  can  be  imple¬ 
mented  by  a  simply  weighted  round  robin  discipline,  which  has  0(1)  time  complexity.  Thus,  the 
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Controlled-Load  trades  a  lower  quality  of  service  for  a  simpler  implementation. 

In  summary,  although  Intserv  provides  much  more  powerful  and  flexible  services  than  today’s 
Internet  -  services  that  would  answer  the  needs  of  the  new  emerging  applications  ~  concerns  with 
respect  to  its  complexity  and  scalability  have  hampered  its  adoption.  In  fact,  except  in  small  test¬ 
beds,  Intserv  solutions  have  yet  to  be  deployed. 

2.3.4  Differentiated  Services 

To  alleviate  the  scalability  problems  that  have  plagued  Intserv,  recently  a  new  service  model,  called 
Differentiated  Services  (Diffserv),  has  been  proposed  [13,  75].  The  Diffserv  architecture  differen¬ 
tiates  between  edge  and  core  routers.  Edge  routers  maintain  per  flow  or  per  aggregate  state.  Core 
routers  maintain  state  only  for  a  veiy  small  number  of  traffic  classes;  they  do  not  maintain  any  fine 
grained  state  about  the  traffic.  Each  packet  can*ies  in  its  header  a  six  bit  field,  called  the  Differ¬ 
entiated  Service  (DS)  field,  which  specifies  the  class  to  which  the  packet  belongs.  The  DS  field 
is  initialized  by  the  ingress  router  upon  the  packet  arrival.  In  turn,  core  routers  use  the  DS  field 
to  classify  and  process  the  packets.  Since  the  number  of  classes  at  a  core  router  is  veiy  small, 
packet  processing  can  be  very  efficiently  implemented.  This  makes  the  Diffserv  architecture  highly 
scalable. 

Two  services  were  proposed  in  the  context  of  the  Diffserv  architecture:  Assured  and  Premium 
services. 


2.3.4. 1  Assured  Service 

The  Assured  service  [24,  55]  is  a  large  granularity  service,  that  is,  the  service  is  associated  with 
the  aggregate  traffic  of  a  customer  from/to  multiple  hosts.  The  service  contract  between  a  customer 
and  the  Diffserv  network  or  ISP  is  called  the  sennce  profile.  A  service  profile  is  usually  defined  in 
tenns  of  absolute  bandwidth  and  relative  loss.  As  an  example,  an  ISP  can  provide  two  service  levels 
(classes):  silver  and  gold,  where  the  gold  service  has  the  lowest  loss  probability.  A  possible  service 
profile  would  offer  transmission  of  10  Mbps  of  customer’s  web  traffic  by  using  the  silver  service. 

In  the  Assured  service  model,  ingress  routers  perform  three  functions.  They  (a)  monitor  the 
aggregate  traffic  from  each  user  to  make  sure  that  no  user  exceeds  its  traffic  profile,  (b)  downgrade 
the  user’s  traffic  to  a  lower  service  level  if  the  user  exceeds  its  profile,  and  (c)  initialize  the  DS  field 
in  the  packet  headers  with  the  code-point  associated  to  the  service.  Thus,  ingress  routers  need  to 
keep  state  for  each  profile  or  user.  In  contrast,  core  routers  do  not  need  to  keep  such  state,  as  their 
function  reduces  to  process  the  packets  based  on  the  code-points  earned  by  the  packets. 

While  the  fixed  bandwidth  profile  makes  the  Assured  service  very  compelling,  it  also  makes  it 
very  challenging  to  implement.  This  is  due  to  a  fundamental  conflict  between  maximizing  resource 
utilization  and  achieving  high  service  assurance.  Since  a  service  profile  does  not  specify  how  the 
traffic  is  distributed  through  the  network,  the  network  has  to  make  conservative  assumptions  to 
achieve  high  service  assurance.  At  the  limit,  to  guarantee  zero  loss,  the  network  has  to  assume  that 
the  entire  assured  traffic  traverses  the  slowest  link  in  the  network!  Clearly,  such  an  assumption  leads 
to  a  very  low  resource  utilization,  which  can  be  unacceptable. 

An  alternate  approach  is  to  define  service  profiles  in  relative  rather  than  absolute  terms.  Such  an 
example  is  the  User-Share  Differentiation  (USD)  approach  [116].  With  USD  each  user  is  assigned 
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a  share  (weight)  that  specifies  the  relative  fraction  of  the  capacity  that  a  user  is  entitled  to  receive 
on  each  link  in  the  network.  This  is  equivalent  to  a  network  in  which  the  capacity  of  each  link 
is  allocated  by  a  Weighted  Fair  Queueing  algorithm.  The  problem  with  such  an  approach  is  that 
the  core  routers  need  to  maintain  per  user  state,  which  can  negate  the  scalability  advantage  of  the 
.  Diffserv  architecture.  In  addition,  with  USD,  there  is  little  correlation  between  the  share  of  a  user 

and  the  aggregate  throughput  it  will  receive.  For  example,  two  users  that  are  assigned  the  same  share 
can  see  drastically  different  aggregate  throughputs.  A  user  that  has  traffic  for  many  destinations 
.  (thus  traverse  many  different  paths)  can  potentially  receive  much  higher  aggregate  throughput  than 

a  user  that  has  traffic  for  only  a  few  destinations. 

2.3A.2  Premium  Service 

Unlike  the  Assured  service  which  can  be  associated  with  an  aggregated  traffic  to/from  multiple 
hosts,  Premium  service  provides  the  equivalent  of  a  dedicated  link  of  fixed  bandwidth  between  two 
edge  routers  [60].  To  implement  this  service,  the  network  has  to  perform  admission  control.  The 
current  proposals  assume  a  centralized  architecture:  each  domain  is  associated  with  a  database, 
called  Bandwidth  Broker  (BB),  that  has  complete  knowledge  about  the  entire  domain.  To  set  up  a 
flow  across  a  domain,  the  domain’s  BB  checks  first  whether  there  ai*e  enough  resources  between  the 
two  end  points  of  the  flow  across  the  domain.  If  yes,  the  request  is  granted  and  the  BB’s  database  is 
updated  accordingly. 

On  the  data-path,  ingress  routers  perform  two  functions.  They  (a)  shape  the  traffic  associated 
to  a  service  profile,  that  is,  make  sure  that  the  traffic  does  not  exceed  the  profile  by  delaying  the 
excess  packets'^,  and  (b)  insert  the  Premium  service  code-point  in  the  DS-files.  In  turn,  core  routers 
forward  the  premium  traffic  with  high  priority. 

As  a  result,  the  Premium  service  can  provide  effective  support  for  real-time  traffic.  A  natural 
question  to  ask  is  what  is  the  difference  between  the  Premium  service  and  the  Guaranteed  service 
proposed  by  Intserv.  Though  at  the  surface  they  are  quite  similar,  there  are  two  important  differences 
between  them. 

First,  while  the  Guaranteed  service  can  provide  both  per  flow  bandwidth  and  delay  differentia¬ 
tion,  the  Premium  service  can  provide  only  per  flow  bandwidth  differentiation.  This  is  because  core 
routers  do  not  differentiate  between  premium  packets  on  a  per  flow  basis  -  all  premium  packets  are 
simply  processed  in  a  FIFO  order.  Thus,  the  only  possibility  to  meet  different  delay  requirements 
for  different  flows  is  to  guarantee  the  smallest  delay  required  by  any  flow  to  all  flows.  Unfortunately, 
this  can  result  in  very  low  resource  utilization  for  the  premium  traffic.  In  particular,  as  shown  by 
Stoica  and  Zhang  [105],  even  if  the  fraction  that  can  be  allocated  to  premium  traffic  on  every  link  in 
»  the  network  is  very  low  (e.g.,  10%),  the  worst  case  queueing  delay  across  a  large  network  (e.g.,  15 

routers)  can  be  relatively  large  (e.g.,  240  ms).  In  contrast,  Intserv  can  achieve  both  higher  resource 
utilization  and  tighter  delay  bounds,  by  better  matching  flow  requirements  to  resource  usage. 

*  Second,  the  centralized  bandwidth  broker  architecture  proposed  to  perform  admission  control 

in  the  case  of  the  Premium  service  is  adequate  only  for  coarse  grained  flows  that  are  active  over 
long  time  scales.  In  contrast,  because  the  Guaranteed  service  uses  a  distributed  admission  control 

"^Note  that  this  is  different  from  the  Assured  service,  where  the  excess  traffic  is  let  into  the  network,  but  its  priority  is 
downgraded. 
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ai'chitecture,  it  can  support  fine  grained  reservations  over  small  time  scales. 

The  price  paid  by  the  Guaranteed  service  is  again  complexity.  Unlike  the  Premium  service,  the 
Guaranteed  service  requires  routers  to  maintain  per  flow  state  on  both  the  data  and  the  control  paths. 

2.4  Summary 

In  the  first  part  of  this  chapter,  we  have  discussed  the  IP  network  model.  In  particular  we  have 
presented  the  router  architecture  and  discussed  the  implementation  complexities  of  both  the  data 
and  control  paths. 

In  the  second  part  of  this  chapter,  we  have  presented  the  best-known  proposals  to  improve  the 
best  effort  service  in  today's  Internet:  (a)  flow  protection  to  provide  effective  support  for  congestion 
control,  (b)  Integrated  Services  (Intserv)  model,  and  (c)  Differentiated  Services  (Diffserv)  model. 
Of  all  these  models,  only  Diffserv  admits  a  known  scalable  implementation,  as  core  routers  are 
not  required  to  maintain  any  per  flow  state.  However,  to  achieve  this,  Diffserv  makes  significant 
compromises.  In  particular,  the  Assured  service  cannot  achieve  simultaneously  high  service  assur¬ 
ance  and  high  resource  utilization.  Similarly,  the  Premium  service  cannot  provide  per  flow  delay 
differentiation,  and  it  is  not  adequate  for  fine  grained  and  short  term  reservations. 

In  this  dissertation  we  address  these  shortcomings  by  developing  a  novel  solution  that  can  im¬ 
plement  all  of  the  above  per  flow  seiwices  (i.e.,  flow  protection,  guaranteed  and  controlled-load 
services)  in  the  Internet  without  compromising  its  scalability.  In  the  next  chapter,  we  present  an 
overview  of  our  solution. 
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Chapter  3 


Overview 


The  main  contribution  of  this  dissertation  is  to  provide  the  first  solution  that  makes  it  possible  to 
implement  services  as  powerful  and  as  flexible  as  the  ones  implemented  by  a  stateful  network  using 
a  stateless  core  network  architecture.  In  this  chapter,  we  give  an  overview  of  our  solution  and  present 
a  perspective  of  how  this  solution  compares  to  the  two  most  prominent  solutions  proposed  in  the 
literature  to  provide  better  services  in  the  Internet:  Integrated  Services  and  Differentiated  Seiwices. 

The  chapter  is  organized  as  follows.  Section  3.1  describes  the  main  components  of  our  solution 
and  uses  thi'ee  examples  to  illustrate  the  ability  of  our  key  technique,  called  Dynamic  Packet  State 
(DPS),  to  provide  per  flow  functionalities  in  a  stateless  core  network.  Section  3.2  briefly  describes 
our  implementation  prototype,  and  gives  a  simple  example  to  illustrate  the  capabilities  of  our  im¬ 
plementation.  Section  3.3  presents  a  comparison  between  our  solution  and  the  two  main  network 
architectures  proposed  by  Internet  Engineering  Task  Force  (IETF)  to  provide  more  sophisticated 
services  in  the  Internet:  Integrated  Services  and  Differentiated  Services.  Finally,  Section  3.4  sum¬ 
marizes  our  findings. 


3.1  Solution  Overview 

This  section  presents  the  three  main  components  of  our  solution.  Section  3.1.1  defines  the  Stateless 
Core  (SCORE)  network  architecture,  which  represents  the  basic  building  block  of  our  solution. 
Section  3.1.2  presents  a  novel  approach  that  allows  us  to  emulate/approximate  the  service  provided 
by  a  stateful  network  with  a  SCORE  network.  Section  3.1.3  describes  the  key  technique  we  use 
to  implement  this  approach:  Dynamic  Packet  State  (DPS).  To  illustrate  this  technique  we  sketch 
how  it  can  be  used  to  implement  three  per  flow  mechanisms  in  a  SCORE  network:  (1)  approximate 
Fair  Queueing  scheduling  discipline,  (2)  provide  per  flow  admission  control,  and  (3)  perform  route 
pinning. 


3.1.1  The  Stateless  Core  (SCORE)  Network  Architecture 

The  basic  building  block  of  our  solution  is  called  Stateless  Core  (SCORE).  Similar  to  a  Diffserv 
domain,  a  SCORE  domain  is  a  contiguous  and  trusted  region  of  network  in  which  only  edge  routers 
maintain  per  flow  state,  while  core  routers  maintain  no  per  flow  state  (see  Figure  Ll(b)).  Since 
edge  routers  usually  run  at  much  lower  speeds  and  handle  fewer  flows  than  the  core  routers,  this 
architecture  is  highly  scalable. 
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3.1.2  The  ‘‘State-Elimination”  Approach 


Our  ultimate  goal  is  to  provide  better  services  in  today's  Internet  without  compromising  its  scalabil¬ 
ity  and  robustness.  To  achieve  this  goal,  we  propose  a  two  step  approach,  called  “State-Elimination” 
approach  (see  Figure  1.1).  In  the  first  step  we  define  a  reference  stateful  network  that  provides  the 
desired  service.  In  the  second  step,  we  trj^  to  approximate  or,  if  possible,  to  emulate  the  service 
provided  by  the  reference  stateful  network  in  a  SCORE  network.  In  this  way  we  are  able  to  pro¬ 
vide  services  as  powerful  and  as  flexible  as  the  ones  implemented  by  stateful  networks  in  a  mostly 
stateless  network,  i.e.,  in  a  SCORE  network.  In  Chapters  4,  5  and  6,  we  illustrate  this  approach 
by  considering  three  of  the  most  important  services  proposed  to  enhance  today's  Internet:  flow 
protection,  guaranteed  sennees,  and  relative  sennee  differentiation. 

We  note  that  similar  approaches  have  been  proposed  in  the  literature  to  approximate  the  func¬ 
tionality  of  an  idealized  router  that  implements  a  bit-by-bit  round-robin  scheduling  discipline  with 
a  stateful  router  that  fomards  traffic  on  a  per  packet  basis  [10,  31, 45,  79].  However,  our  approach 
differs  from  these  approaches  in  two  significant  aspects.  First,  the  state-elimination  approach  is  con¬ 
cerned  with  emulating  the  functionality  of  an  entire  network,  rather  than  of  a  single  router.  Second, 
unlike  previous  approaches  that  aim  to  approximate  an  idealized  system  with  a  stateful  system,  our 
goal  is  to  approximate  the  functionality  of  a  stateful  system  with  a  stateless  core  system. 

3.1.3  The  Dynamic  Packet  State  (DPS)  Technique 

DPS  is  the  key  technique  that  allows  us  to  implement  the  above  services  in  a  SCORE  network.  The 
main  idea  behind  DPS  is  very  simple:  instead  of  having  routers  install  and  maintain  per  flow  state, 
have  packets  carry  the  per  flow  state.  This  state  is  inserted  by  ingress  routers,  which  maintain  per 
flow  state.  In  turn,  a  core  router  processes  each  incoming  packet  based  on  (1)  the  state  earned  in 
the  packet’s  header,  and  (2)  the  router’s  internal  state.  Before  forwarding  the  packet  to  the  next  hop, 
the  core  router  updates  both  its  internal  state  and  the  state  in  the  packet’s  header  (see  Figure  1 .2). 
By  using  DPS  to  coordinate  actions  of  edge  and  core  routers  along  the  path  traversed  by  a  flow, 
distributed  algorithms  can  be  designed  to  approximate  the  behavior  of  a  broad  class  of  stateful 
networks  using  networks  in  which  core  routers  do  not  maintain  per  flow  state. 

To  give  an  intuition  of  how  the  DPS  technique  is  working,  next  we  present  three  examples:  (1) 
approximates  the  Fair  Queueing  algorithm,  (2)  estimates  the  aggregate  reservation  for  admission 
control  purposes,  and  (3)  binds  a  flow  to  a  particular  path  (i.e.,  perform  route-pinning). 

3.1.3.1  Example  1:  Fair  Bandwidth  Allocation 

Flow  protection  is  one  of  the  most  desirable  enhancements  of  today’s  best  effort  service.  Flow  pro¬ 
tection  allows  diverse  end-to-end  congestion  control  schemes  to  seamlessly  coexist  in  the  Internet, 
and  protect  well  behaved  traffic  against  malicious  or  ill  behaved  traffic.  The  solution  of  choice  to 
achieve  flow  protection  is  to  have  routers  implement /a/r  bandwidth  allocation  [31].  In  an  idealized 
system  in  which  a  router  can  provide  services  at  the  bit  granularity,  fair  bandwidth  allocation  can  be 
achieved  by  using  a  bit-by-bit  round  robin  discipline. 

For  clarity,  consider  three  flows  with  the  arrival  rates  of  8,  6,  and  2  bits  per  second  (bps), 
respectively,  that  share  a  10  bps  link.  Assume  that  the  traffic  of  each  flow  arrives  one  bit  at  a  time, 
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Figure  3.1:  Example  illustrating  the  CSFQ  algorithm  at  a  core  router.  An  output  link  with  a  capacity  of 
10  is  shared  by  three  flows  with  arrival  rates  of  8,  6,  and  2,  respectively.  The  fair  rate  of  the  output  link  in 
this  case  is  a  =  4.  Each  arrival  packet  carries  in  its  header  the  rate  of  the  flow  it  belongs  to.  According  to 
Eq.  (3.3)  the  dropping  probability  for  flow  1  is  0.5,  while  for  flow  2  it  is  0.33.  Dropped  packets  are  indicated 
by  crosses.  Before  forwarding  a  packet,  its  header  is  updated  to  reflect  the  change  in  the  flow's  rate  due  to 
packet  dropping  (see  the  packets  at  the  right-hand  side  of  the  router). 


and  it  is  periodic  with  period  l/r,  where  r  is  the  flow  rate.  Thus,  during  one  second,  exactly  16 
bits  are  received,  and  exactly  10  bits  can  be  transmitted.  During  each  round,  the  scheduler  transmits 
exactly  one  bit  form  every  flow  that  has  a  packet  to  send.  Since  in  the  worst  case,  flow  3  is  visited 
once  every  3/10  sec,  and  it  has  an  arrival  rate  of  only  one  bit  every  0.5  sec,  it  follows  that  all  of 
its  traffic  is  served.  This  leaves  the  other  two  flows  to  share  the  rest  of  8  bps  of  the  link  capacity. 
Since  arrival  rates  of  both  flows  1  and  2  are  larger  than  half  of  the  remaining  capacity,  each  flow  will 
receive  half  of  it,  i.e.,  4  bps.  As  a  result,  under  the  bit-by-bit  round  robin  discipline,  the  three  flows 
are  allocated  bandwidth  of  4,  4,  and  2  bps,  respectively.  The  maximum  rate  allocated  to  a  flow  on  a 
congestion  link  is  called /a/r  rate.  In  this  example  the  fair  rate  is  4. 

In  general,  given  n  flows  that  traverse  a  congested  link  of  capacity,  C,  the  fair  rate  a  is  defined 
such  that 

n 

y]^min(ri,a)  =  C,  (3.1) 

i=l 

where  n  represents  the  arrival  rate  of  flow  i.  By  applying  this  formula  to  the  previous  example,  we 
have  min(8,  a)  +  min(6,  a)  +  min(2,  a)  =  10,  which  gives  us  a  =  4.  If  the  link  is  not  congested, 
that  is,  if  Yaz=i  H  <  C,  the  fair  rate,  a,  is  by  convention  defined  as  being  the  maximum  among  all 
arrival  rates. 

Thus,  with  the  bib-by-bit  round  robin,  the  service  rate  allocated  to  a  flow,  i,  with  the  arrival  rate, 
ri,is 


min(ri,0!). 


(3.2) 
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The  first  algorithm  to  approximate  the  bit-by-bit  round  robin  in  a  packet  system  was  proposed 
by  Demers  et  al.  [31],  and  it  is  called  Fair  Queueing.  Eq.  (3.2)  directly  illustrates  the  protection 
property  of  Fair  Queueing,  that  is,  a  flow  with  enough  demand  is  guaranteed  to  receive  its  fair  rate 
O',  irrespective  of  the  behavior  of  the  other  flows.  To  put  it  in  another  way,  a  flow  cannot  deny  service 
to  other  flows  because  no  matter  how  much  and  what  type  of  traffic  it  pumps  into  the  network,  it 
will  not  get  more  than  o  on  the  congested  link. 

While  fair  queueing  can  fully  provide  flow  protection,  it  is  more  complex  to  implement  than  tra¬ 
ditional  FIFO  queueing  with  drop-tail,  which  is  the  most  widely  implemented  and  deployed  mech¬ 
anism  in  routers  today.  For  each  packet  that  anives  at  the  router,  the  routers  needs  to  classify  the 
packet  into  a  flow,  update  per  flow  state  variables,  and  perfomi  per  flow  scheduling. 

Our  goal  is  to  eliminate  this  complexity  from  the  network  core  by  using  a  SCORE  network 
architecture  to  approximate  the  functionality  of  a  reference  network  in  which  every  router  performs 
Fair  Queueing.  In  the  following  sections,  we  describe  a  DPS  based  algorithm,  called  Core-Stateless 
Fair  Queueing  (CSFQ)  that  achieves  this  goal. 

The  key  idea  of  CSFQ  is  to  have  each  packet  cany  the  rate  estimate  of  the  flow  it  belongs  to. 
Let  fi  denote  the  rate  estimate  canned  by  a  packet  of  flow  i.  The  rate  estimate  is  computed  by  edge 
routers  and  then  inserted  in  the  packet  header.  Upon  receiving  the  packet,  a  core  router  forwards  it 
with  the  probability 

p  ~  mill  ^1.  — ^  .  (3.3) 

and  drops  it  with  the  probability  1  —  p. 

It  is  easy  to  see  that  by  forwarding  each  packet  with  the  probability  p,  the  router  effectively 
allocates  to  flow,  i,  a  rate  ry  x  p  —  min(7v^o),  which  is  exactly  the  rate  the  flow  would  receive 
under  Fair  Queueing  (see  Eq.  (3.2)).  If  p  <  1,  the  router  also  updates  the  packet  label  to  a.  This  is 
to  reflect  the  fact  that  when  the  flow's  anival  rate  is  larger  than  a,  the  flow's  rate  after  traversing  the 
link  drops  to  a  (see  Figure  3.1). 

It  is  also  easy  to  see  that  with  CSFQ  core  routers  do  not  require  any  per  flow  state.  Upon  packet 
anival,  a  core  router  needs  to  compute  only  the  dropping  probability,  p,  which  depends  exclusively 
on  the  estimated  rate  earned  by  the  packet,  and  the  fair  rate  a  that  is  locally  computed  by  the  router. 
(In  Chapter  4,  we  show  that  computing  a  does  not  require  per  flow  state  either.) 

Figure  3.1  shows  an  example  in  which  three  flows  with  incoming  rates  of  8,  6,  and  2,  respec¬ 
tively,  share  a  link  of  capacity  10.  Without  going  into  details,  we  note  that  in  this  case  rv  =  4.  Then, 
from  Eq.  (3.3),  it  follows  that  the  forwarding  probabilities  of  the  three  flows  are  0.5,  0.66,  and  1, 
respectively.  As  a  result,  on  the  average,  one  out  of  two  packets  of  flow  1 ,  one  out  of  three  packets 
of  flow  2,  and  no  packets  of  flow  3,  are  dropped.  Note  that  before  forwarding  the  packets  of  flows  1 
and  2,  the  router  updates  the  rate  estimates  in  their  headers  to  4.  This  is  to  reflect  the  change  of  the 
flow  rates  as  a  result  of  packet  dropping. 

3.1.3.2  Example  2:  Per  Flow  Admission  Control 

In  this  example  we  consider  the  problem  of  performing  per  flow  admission  control.  The  role  of 
the  admission  control  is  to  check  whether  there  are  enough  resources  on  the  data  path  to  grant  a 
reservation  request.  For  simplicity,  we  assume  that  admission  control  is  limited  to  bandwidth.  When 
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Figure  3.2:  Example  illustrating  the  estimation  of  the  aggregate  reservation.  Two  flows  with  reservations 
of  5,  and  2,  respectively,  share  a  common  link.  Ingress  routers  initialize  the  header  of  each  packet  according 
to  Eq.  (3.4).  The  aggregate  reservation  is  estimated  as  the  ratio  between  the  sum  of  the  values  carried  in 
the  packets’  headers  during  an  averaging  time  interval  of  length  T.  In  this  case  the  estimated  reservation  is 
65/10  =  6.5. 


a  new  flow  makes  a  reservation  request,  each  router  on  the  path  from  source  to  destination  checks 
whether  it  has  enough  bandwidth  to  accommodate  the  new  flow.  If  all  routers  can  accommodate  the 
flow,  then  the  reservation  is  granted. 

It  is  easy  to  see  that  to  decide  whether  a  new  reservation  request  rsv  can  be  accepted  or  not, 
a  router  needs  only  to  know  the  current  aggregate  reservation,  R,  on  the  output  link,  that  is,  how 
much  bandwidth  it  has  reserved  so  far.  In  particular,  if  the  capacity  of  the  output  link  is  C,  then  the 
router  can  accept  a  reservation,  rsv,  as  long  as  rsv  +  R  <  C.  Unfortunately,  it  turns  out  that  main¬ 
taining  the  aggregate  reservation,  R,  in  the  presence  of  packet  loss  and  partial  reservation  failures 
is  not  trivial.  Intuitively,  this  is  because  the  admission  control  needs  to  implement  transaction-like 
semantics.  A  reservation  is  granted  if  and  only  if  all  routers  along  the  path  accept  the  reservation.  If 
a  router  cannot  accept  a  reservation,  then  all  routers  that  have  accepted  the  reservation  have  to  roll 
back  to  the  previous  state,  so  they  need  to  remember  that  state.  Similarly,  if  a  reservation  request 
message  is  lost,  and  the  request  is  resent,  then  a  router  has  to  remember  whether  it  has  received 
the  original  request,  and  if  yes,  whether  the  request  was  granted  or  denied.  For  all  these  reasons, 
the  current  proposed  solutions  for  admission  control  such  as  RSVP  [128]  and  ATM  UNI  [1]  require 
routers  to  maintain  per  flow  state. 

In  the  remainder  of  this  section  we  show  that  by  using  DPS  it  is  possible  to  perform  admission 
control  in  a  SCORE  network,  that  is,  without  core  routers  maintaining  any  per  flow  state. 

At  the  basis  of  our  scheme  lies  a  simple  observation:  if  all  flows  were  sending  at  their  reserved 
rates,  then  it  is  trivial  to  maintain  the  aggregate  reservation  R;  each  router  only  needs  to  measure 
the  rate  of  the  aggregate  traffic.  Consider  the  example  in  Figure  3.2,  and  assume  that  flow  1  has 
a  reservation  of  5  Kbps,  and  flow  2  has  a  reservation  of  2  Kbps.  If  the  two  flows  were  sending 
exactly  at  their  reserved  rates,  i.e.,  flow  1  at  5  Kbps,  and  flow  2  at  2  Kbps,  the  hi-lighted  router  (see 
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Figure  3.2)  can  simply  estimate  the  aggregate  reservation  of  7  Kbps,  by  measuring  the  rate  of  the 
aggregate  anival  traffic. 

The  obvious  problem  with  the  above  scheme  is  that  most  of  the  time  flows  do  not  send  at  their 
reserved  rates.  To  address  this  problem,  we  associate  a  virtual  length  to  each  packet.  The  viitual 
length  is  such  that  if  the  lengths  of  all  packets  of  a  flow  where  equal  to  their  virtual  lengths,  then  the 
flow  sends  at  its  reserved  rate.  More  precisely,  the  virtual  length  of  a  packet  represents  the  amount  of 
traffic  that  the  flow  was  entitled  to  send  according  to  its  reserved  rate  since  the  previous  packet  has 
been  transmitted.  Let  rsvi  denote  the  reservation  of  flow  i,  and  let  t]  and  denote  the  departure 
times  of  the  j-th  and  (y  +  l)-th  packets  of  flow  i.  Then  the  (  7  +  l)-th  packet  will  cany  in  its  header 
a  virtual  length 

=rsvi  X  -tj).  (3.4) 


The  virtual  length  of  the  first  packet  of  the  flow  is  simply  the  actual  length  of  the  packet.  The 
virtual  length  is  computed  and  inserted  in  the  packet  header  by  the  ingress  router  upon  the  packet 
departure.  In  turn,  core  routers  use  the  packet  virtual  lengths  to  estimate  the  aggregate  reservation. 
For  illustration,  consider  again  the  example  in  Figure  3.2,  where  flow  1  has  a  reservation  of  5 
Kbps.  For  the  puipose  of  this  example,  we  neglect  the  delay  jitter,  and  assume  that  no  packets  are 
dropped  inside  the  core.  Suppose  the  inter-airival  times  of  the  first  four  packets  of  flow  1  are  2  sec, 
3  sec,  and  4  sec,  respectively.  Since  in  this  case  the  packet  inter-arrival  times  at  core  routers  are 
equal  to  the  packet  inter-departure  times  at  the  ingress,  according  to  Eq.  (3.3),  the  2nd,  3rd,  and  4th 
packet  of  flow  1  will  carry  in  their  headers  vlf  =  r.s?;i  x  2  =  10  Kb,  vl^  =  rsvi  x  3  =  15  Kb,  and 
vlj  =  rsv]  X  4  =  20  Kb,  respectively. 

Next,  note  that  the  sum  of  the  virtual  lengths,  Bj(T),  of  all  packets  of  flow  i  that  arrive  at  a  core 
router  during  an  interval  of  length  T,  provides  a  fair  approximation  of  the  amount  of  traffic  that  the 
flow  is  entitled  to  send  during  time  T  at  its  resen’ed  rate.  Then,  the  reser\'ed  bandwidth  of  flow  i, 
can  be  estimated  as 


Ri  = 


BdT) 

T 


(3.5) 


By  extrapolation,  a  core  router  can  estimate  the  aggregate  reservation  R  on  the  outgoing  link 
by  simply  computing  B{T)/T,  where  B{T)  represents  the  sum  of  the  virtual  lengths  of  all  packets 
that  arrive  during  an  interval  of  length  T.  Finally,  it  is  worth  noting  that  to  perform  this  computation 
core  routers  do  not  need  to  maintain  any  per  flow  state  -  they  just  need  to  maintain  a  global  variable, 
B{T),  that  is  updated  every  time  a  new  packet  arrives. 

In  the  example  shown  in  Figure  3.2,  assume  an  averaging  interval  T  =  10  sec  (represented  by 
the  shaded  area).  Then  we  have  B{T)  =  By{T)  +  B-iit)  =  65  Kb,  which  gives  us  an  estimate  of 
the  aggregate  reservation  of  .R  =  B{T)/T  =  6.5  Kbps,  which  is  “reasonably”  close  to  the  actual 
aggregate  reservation  R  =  7  Kbps. 

In  Chapter  5  we  derive  an  upper  bound  of  the  aggregate  reservation  along  a  link,  instead  of  just 
an  estimate.  By  using  the  upper  bound  we  can  guarantee  that  the  link  is  never  over-provisioned. 


3.1.3.3  Example  3:  Route  pinning 

Many  applications  such  as  traffic  engineering  and  guaranteed  services  require  that  all  packets  of 
a  flow  to  follow  the  same  path.  To  achieve  this  goal,  many  solutions  such  as  Tenet  RCAP  [8] 
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Figure  3.3:  Example  illustrating  the  route  pinning  algorithm.  Each  packet  contains  in  its  header  the  path’s 
label,  defined  as  the  xor  over  the  identifiers  of  all  routers  on  the  remaining  path  to  the  egress.  Upon  packet 
arrival,  the  packet's  header  is  updated  to  the  label  of  the  remaining  path.  The  routing  decisions  are  exclusively 
based  on  the  packet’s  label  (here  the  labels  are  assumed  to  be  unique). 

and  RSVP  [128]  requme  routers  to  maintain  per  flow  state.  In  this  section  we  present  an  alternate 
solution  based  on  DPS  in  which  core  routers  do  not  need  to  maintain  any  per  flow  state. 

The  key  idea  is  to  label  a  path  by  xor-ing  the  identifiers  of  all  routers  along  the  path.  Consider 
a  path  ido^idi^ . . .  ^  idn,  where  idj  represents  the  identifier  of  the  j-th  router  along  the  path.  The 
label  I  of  this  path  at  router  ido  is  then 

I  —  idi  ®  id2  ®  .  0  idn-  (3.6) 

In  the  example  in  Figure  3.3,  the  label  of  flow  1  that  enters  the  network  at  the  ingress  router 
0010,  and  traverses  routers  1100,  1011  and  0011  is  simply  1100  0  1011  0  0011  =  0100. 

The  DPS  algorithm  in  this  case  is  as  follows:  Each  ingress  router  maintains  a  label  I  for  every 
t  flow  that  traverses  it.  Upon  packet  arrival,  ingress  routers  insert  the  label  in  the  packet  header.^ 

Upon  receiving  a  packet,  a  core  router  recomputes  the  label  of  the  remaining  path  by  xor-ing  the 
label  carried  by  the  packet  to  its  identifier.  For  example,  when  the  first  core  router,  identified  by 
‘  idi,  receives  a  packet  with  label  I,  it  recomputes  a  new  label  as  I  =  I  ®  idi.  Note  that  by  doing 

so  the  new  label  represents  exactly  the  identifier  of  the  remaining  path,  i.e.,  id2  ^  ids  <S> . , .  ®  idn* 
Finally,  the  core  router  updates  the  label  in  the  packet  header,  and  uses  the  resulting  label  to  forward 

^Note  that  for  simplicity,  we  do  not  present  here  how  ingress  routers  obtain  these  labels.  Also,  we  assume  that  path 
labels  are  unique,  and  therefore  the  routing  decisions  can  be  exclusively  based  on  the  path  label.  Finally,  we  do  not 
discuss  the  impact  of  our  scheme  on  address  aggregation.  We  remove  all  these  limitations  in  Chapter  6. 
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the  packet.  Thus,  a  core  router  is  not  required  to  maintain  per  flow  state,  as  it  forwards  each  packet 
based  on  the  label  in  its  header. 

Figure  3.3  gives  an  example  of  three  flows  that  arrive  to  core  router  1 1 00,  and  exit  the  network 
through  the  same  egress  router  0011.  However,  while  flows  1  and  2  are  routed  on  identical  paths, 
flow  3  is  routed  on  a  different  path.  When  a  packet  arrives,  the  label  in  the  packet  header  is  updated 
by  xor-ing  it  with  the  router  identifier.  Subsequently,  the  new  label  is  used  to  route  the  packet. 
Note  that  only  one  entiy  is  maintained  for  both  flows  1  and  2. 


3.2  Prototype  Implementation 

To  demonstrate  that  it  is  possible  to  efficiently  implement  and  deploy  our  solutions  in  today's  IPv4 
networks,  we  have  developed  a  prototype  implementation  in  FreeBSD  v2.2.6.  The  prototype  fully 
implements  the  guaranteed  service  as  described  in  Chapter  5,  arguably  the  most  complex  of  all  the 
solutions  we  describe  in  this  dissertation.  Without  going  into  detail,  we  note  that  our  solution  to  pro¬ 
vide  guaranteed  services  tries  to  closely  approximate  an  idealized  model  in  which  each  guaranteed 
flow  traverses  dedicated  links  of  capacity  7%  where  r  is  the  flow  reservation.  Thus,  in  the  idealized 
system,  a  flow  with  a  reservation  of  1  Mbps  behaves  as  if  it  is  the  only  flow  in  a  network  in  which 
all  links  are  of  1  Mbps. 

The  prototype  runs  in  a  test-bed  consisting  of  300  MHz  and  400  MHz  Pentium  II  PCs  connected 
by  point-to-point  100  Mbps  Ethernets.  The  test-bed  allows  the  configuration  of  a  path  with  up  to  two 
core  routers.  Although  we  had  complete  control  of  our  test-bed,  and,  due  to  resource  constraints,  the 
scale  of  our  experiments  was  rather  small  (e.g.,  the  largest  experiment  involved  just  100  flows),  we 
have  devoted  special  attention  to  making  our  implementation  as  general  as  possible.  For  example, 
while  in  the  current  implementation  we  re-use  protocol  space  in  the  IP  header  to  store  the  DPS 
state,  we  make  sure  that  the  modified  fields  can  be  fully  restored  by  the  egress  router.  In  this  way, 
the  changes  operated  by  the  ingress  and  core  routers  on  the  packet  header  are  completely  transparent 
to  the  outside  world.  Similarly,  while  the  limited  scale  of  our  experiments  would  have  allowed  us 
to  use  simple  data  structures  to  implement  our  algorithms,  we  go  to  great  length  to  make  sure  that 
our  implementation  is  scalable.  For  example,  instead  of  using  a  simple  linked  list  to  implement  the 
packet  scheduler,  we  use  a  calendar  queue  together  with  a  two-level  priority  queue  to  efficiently 
handle  a  very  large  number  of  flows  (see  Section  8.1). 

For  debugging  and  management  purposes,  we  implemented  full  support  for  packet  level  moni¬ 
toring.  This  allows  us  to  visualize  simultaneously  and  in  real-time  the  throughputs  and  the  delays 
experienced  by  flows  at  different  points  in  the  network.  A  key  challenge  when  implementing  such 
a  fine  grained  monitoring  functionality  is  to  minimize  the  interferences  with  the  system  operations. 
We  use  two  techniques  to  address  this  challenge.  First,  we  off-load  as  much  as  possible  of  the  pro¬ 
cessing  of  log  data  on  an  external  machine.  Second,  we  use  raw  IP  to  send  directly  the  log  data  from 
router’s  kernel  to  the  external  machine.  This  way,  we  avoid  context-switching  between  the  kernel 
and  the  user  level. 

To  easily  configure  our  system,  we  have  implemented  a  command  line  configuration  tool.  This 
tool  allows  us  (1)  to  configure  routers  as  ingress,  egress,  or  core,  (2)  set-up,  modify,  and  tear-down 
a  reservation,  and  (3)  set-up  the  monitoring  parameters.  To  minimize  the  interferences  between 
the  configuration  operations  and  data  processing,  we  implement  our  tool  on  top  of  the  Internet 
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Control  Management  Protocol  (ICMP).  Again,  by  using  ICMP,  we  avoid  context-switching  when 
configuring  a  router. 

3.2.1  An  Example 

To  illustrate  how  our  entire  package  is  working,  in  this  section  we  present  a  simple  example.  We 
consider  three  flows  traversing  a  three  hop  path  in  our  test-bed  (see  Figure  3.4).  The  first  router 
on  the  path  (i.e.,  aruba.cmcl.cs.cmu.edu)  is  configured  as  an  ingress  router,  while  the  next 
router  (i.e.,  cozumel .  cmcl .  cs .  emu .  edu)  is  configured  as  a  core  router.  The  link  between  the 
two  routers  is  configured  to  10  Mbps.  The  traffic  of  each  flow  is  generated  by  a  different  end-host 
to  eliminate  the  potential  interferences.  All  flows  are  UDP,  and  send  1000  byte  data  packets.  Flows 
1  and  2  are  guaranteed,  while  flow  3  is  best-effort.  More  precisely, 

•  Flow  1  is  a  constant-bit  rate  (CBR)  flow  with  an  anival  rate  of  1  Mbps,  and  a  reservation  of  1 
Mbps. 

•  Flow  2  is  ON-OFF,  with  the  ON  and  OFF  periods  of  10  sec  each.  During  the  ON  period  the 
flow  sends  at  3  Mbps,  while  during  the  OFF  period  the  flow  does  not  send  anything.  The  flow 
has  a  reservation  of  3  Mbps. 

•  Flow  3  is  CBR  with  an  arrival  rate  of  approximately  8  Mbps.  Unlike  flows  1  and  2,  this  flow 
is  best-effort,  i.e.,  it  does  not  have  any  reservation. 

Note  that  when  all  flows  are  active,  the  total  offered  load  is  about  12  Mbps,  which  exceeds 
the  link  capacity  by  2  Mbps.  As  a  result,  during  these  time  periods  the  ingress  router  is  heavily 
congested. 

To  observe  the  behavior  of  our  implementation  during  this  experiment,  we  use  an  external  ma¬ 
chine  (i.e.,  an  IBM  ThinkPad  560E  notebook)  to  monitor  the  three  flows  at  the  end-points  of  the 
congested  link:  aruba  and  cozumel.  Figure  3.5  shows  a  screen  snapshot  of  our  monitoring  tool 
that  plots  the  arrival  rates  and  the  delays  experienced  by  the  three  flows  at  aruba,  and  cozumel, 
respectively,  over  a  56  sec  time  interval.  The  top-left  plot  shows  the  arrival  rates  of  the  three  flows 
at  aruba,  while  the  top-right  plot  shows  their  arrival  rates  at  cozumel.  All  rates  represent  averages 
over  a  200  ms  time  period.  As  expected,  flow  1,  which  has  a  reservation  of  1  Mbps,  and  sends  traffic 
at  1  Mbps,  gets  all  its  traffic  through  the  congested  link.  This  is  illustrated  by  the  straight  line  at  1 
Mbps  that  appears  in  both  plots.  The  same  is  true  for  flow  2;  whenever  it  sends  at  3  Mbps  it  gets 
its  reservation.  That  is  why  the  arrival  rate  of  flow  2  looks  identical  in  the  two  plots.  In  contrast, 
as  shown  in  the  top-right  plot,  flow  3  gets  its  service  only  when  the  link  is  uncongested,  i.e.,  when 
flow  2  does  not  send  anything.  This  is  because  flow  3  is  best-effort,  and  therefore  when  both  flows 
1  and  2  fully  use  their  reservations,  flow  3  gets  only  the  remaining  bandwidth,  which  in  this  case  is 
about  6  Mbps. 

The  bottom-left  and  the  bottom-right  plots  in  Figure  3.5  show  the  delays  experienced  by  each 
flow  at  aruba,  and  cozumel,  respectively.  Each  data  point  represents  the  maximum  delay  among 
all  packets  of  a  flow  over  a  200  ms  time  period.  Note  the  different  scales  on  the  y-axis  of  the  two 
plots.  Next,  we  explain  in  more  detail  the  results  shown  by  these  two  plots. 
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Figure  3.4:  The  topology  used  in  the  experiment  reported  in  Section  3.2.  Flow  1  is  CBR,  has  an  anival  rate 
of  1  Mbps,  and  a  reservation  of  1  Mbps.  Flow  2  is  ON-OFF;  it  sends  3  Mbps  during  ON  periods  and  doesn't 
send  anything  during  OFF  periods.  The  flow  has  a  reservation  of  3  Mbps.  Flow  3  is  best-effort  and  has  an 
arrival  rate  of  8  Mbps.  The  link  between  aruba  and  cozumel  is  configured  to  10  Mbps. 
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Figure  3.5:  A  screen-shot  of  our  monitoring  tool  that  displays  the  real-time  measurement  results  for  the 
experiment  shown  in  Figure  3.4.  The  top  two  plots  show  the  arrival  rate  of  each  flow  at  aruba  and  cozumel; 
the  bottom  two  plots  show  the  delay  experienced  by  each  flow  at  the  two  routers. 
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Consider  a  flow  i  with  reservation  that  traverses  a  link  of  capacity  C.  Assume  that  the  arrival 
rate  of  the  flow  never  exceeds  its  reservation  and  that  all  packets  have  length  L  Then,  it  can  be 
shown  that  the  worst  case  delay  of  a  packet  at  a  router  is^ 


Intuitively,  the  first  term,  I /r^,  represents  how  much  it  takes  to  transmit  the  packet  in  the  ideal  model 
in  which  the  flow  traverses  dedicated  links  of  capacity  equal  to  its  reservation  The  second  term, 
I  / C,  represents  the  fact  that  in  a  real  system  all  flows  share  the  link  of  capacity  (7,  and  that  the 
packet  transmission  is  not  preemptive. 

Since  in  our  case  I  ~  8400  bits  (this  includes  the  packet  headers),  (7  10  Mbps,  ri  ~  1  Mbps, 

and  r2  =  3  Mbps,  respectively,  according  to  Eq.  (3.7),  the  worst  case  delay  of  flow  1  is  about  9.2 
ms,  and  the  worst  case  delay  of  flow  2  is  about  3.6  ms.  This  is  confirmed  by  the  bottom  two  plots. 
As  it  can  be  seen,  especially  in  the  bottom-right  plot,  the  measured  delays  for  both  flows  are  close 
to  the  theoretical  values.  The  reason  for  which  the  measured  values  are  consistently  close  to  the 
worst  case  bounds  is  due  to  the  non- work  conserving  nature  of  CJVC;  even  if  the  output  link  is 
idle,  a  packet  of  flow  i  can  wait  for  up  to  I /ri  time  in  the  rate-regulator  before  becoming  eligible  for 
transmission  (see  Section  5.3).  The  fact  that  the  measured  delays  occasionally  exceed  the  theoretical 
bounds  is  because  FreeBSD  is  not  a  real-time  operating  systems.  As  a  result,  packet  processing  may 
take  occasionally  longer  because  unexpected  interrupts,  or  system  calls. 

Finally,  it  is  worth  noting  that  when  flow  2  is  active,  flow  3  experiences  very  large  delays  at  the 
ingress  router,  i.e.,  over  80  ms.  This  is  because  during  these  time  periods  flow  3  is  restricted  to  6 
Mbps,  while  its  arrival  rate  is  about  8  Mbps.  In  contrast,  at  the  subsequent  router,  the  packet  delay 
of  flow  3  is  much  smaller,  i.e.,  under  2  ms.  This  is  because  the  core  router  is  no  longer  congested 
after  the  ingress  has  shed  the  extra  traffic  of  flow  3.  The  reason  the  delay  experienced  by  flow  3  is 
even  lower  than  the  delays  experienced  by  the  guaranteed  flows  is  because,  unlike  these  flows,  flow 
3  is  not  regulated,  and  therefore  its  packets  are  eligible  for  transmission  as  soon  as  they  arrive. 


3.3  Comparison  to  Intserv  and  Diffserv 

To  enhance  the  best  effort  service  in  the  Internet,  over  the  past  decade  the  Internet  Engineering  Task 
Force  (IETF)  has  proposed  two  major  service  architectures:  Integrated  Services  (Intserv)  [82]  and 
Differentiated  Services  (Diffserv)  [32].  In  this  section,  we  compare  our  SCORE  architecture  to  both 
Intserv  and  Diffserv. 

3.3.1  Intserv 

As  discussed  in  Section  2.3.3,  Intserv  is  able  to  provide  powerful  and  flexible  services,  such  as 
Guaranteed  [93]  and  Controlled-Load  services  [121],  on  a  per  flow  basis.  However  this  comes  at  the 
expense  of  a  substantial  increase  in  the  complexity  as  compared  to  today’s  best-effort  architecture. 
In  particular,  traditional  Intserv  solutions  require  routers  to  perform  per  flow  admission  control  and 

“This  result  follows  from  Appendix  B.2,  which  shows  that  the  worst  case  delay  of  our  packet  scheduler,  called  Core 
Jitter  Virtual  Clock  (CJVC)  is  identical  to  the  worst  case  delay  of  Weighted  Fair  Queueing  (WFQ)  [79]. 
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maintain  per  flow  state  on  the  control  path,  and  to  perform  per  flow  classification,  scheduling,  and 
buffer  management  on  the  data  path.  This  complexity  is  arguably  the  main  technical  reason  behind 
the  failure  to  deploy  Intserv  in  the  Internet. 

3.3.1.1  SCORE  Advantages 

Most  of  the  advantages  of  SCORE  over  Intserv  derive  from  the  fact  that  in  a  SCORE  network,  core 
routers  do  not  need  to  maintain  any  per  flow  state.  These  advantages  are: 

•  Scalability  The  fact  that  in  a  SCORE  network  routers  do  not  need  to  maintain  per  flow  state, 
significantly  simplifies  both  the  control  and  the  data  paths. 

On  the  data  path,  routers  are  no  longer  required  to  perform  per  flow  classification,  which  is 
arguably  the  most  complex  operation  on  the  data  path  (see  Section  2.2.4).  In  addition,  as 
we  will  show  in  Chapter  5,  the  complexity  of  buffer  management  and  packet  scheduling  are 
greatly  reduced. 

On  the  control  path,  as  we  have  briefly  discussed  in  Section  3. 1.3.2,  and  as  we  will  show 
in  more  detail  in  Chapter  5,  by  using  DPS  it  is  also  possible  to  perfoiTn  per  flow  admission 
control  in  a  SCORE  network.  Ultimately,  the  absence  of  per  flow  state  at  core  routers  trivially 
eliminates  one  of  the  biggest  challenges  faced  by  stateful  solutions  in  general,  and  Intserv  in 
particular:  maintaining  the  consistency  of  per  flow  state. 

In  summaiy,  the  fact  that  core  routers  are  not  required  to  perform  any  per  flow  management, 
makes  the  SCORE  architecture  highly  scalable  with  respect  to  the  number  of  flows  that  tra¬ 
verse  a  router. 

•  Robustness  Eliminating  the  need  to  maintain  per  flow  state  at  core  routers  has  another  desir¬ 
able  consequence:  the  SCORE  architecture  is  more  robust  in  the  presence  of  link  and  router 
failures.^  This  is  due  to  the  inherent  difficulty  of  maintaining  the  consistency  of  dynamic, 
and  replicated  state  in  a  distributed  environment.  As  pointed  out  by  Clark  [22]:  ''because 
of  the  distributed  nature  of  the  replicatiojh  algorithms  to  ensure  robust  replication  are  them¬ 
selves  difficult  to  build,  and  few  networks  with  distributed  state  information  provide  any  sort 
of  protection  against  failure While  soft-state  mechanisms  such  as  RSVP  can  alleviate  this 
problem,  there  is  a  fundamental  trade-off  between  message  complexity  and  the  time  period 
during  which  the  system  is  “allowed”  to  be  in  an  inconsistent  state:  the  shorter  this  period  is, 
the  greater  the  signalling  overhead  is. 


3.3.1.2  Intserv  Advantages 

While  in  this  dissertation  we  show  that  SCORE  can  implement  the  strongest  semantic  service  pro¬ 
posed  by  Intserv  so  far,  i.e.,  the  guaranteed  service,  it  is  still  unclear  whether  SCORE  can  implement 
all  possible  per  flow  services  that  can  be  implemented  by  Intserv,  To  offer  intuition  as  to  what  might 
be  difficult  to  implement  in  a  SCORE  network,  consider  a  service  in  which  a  flow  is  allocated  a  dif¬ 
ferent  share  of  the  link  capacity  at  each  router  along  its  path.  In  such  a  service  a  flow  will  receive  on 

^In  the  case  of  a  router,  here  we  consider  only  fail-stop  type  of  failures,  i.e.,  the  fact  that  the  router  (process)  has  failed 
is  detectable  by  other  routers  (processes). 
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each  link  bandwidth  in  proportion  to  its  share.  To  implement  this  service  in  SCORE,  packets  would 
have  to  carry  complete  information  about  flow  shares  at  all  routers.  Unfortunately,  this  may  com¬ 
promise  the  DPS  algorithms’  scalability  as  the  state  will  increase  with  the  path  length.  In  Chapter  9 
we  discuss  in  more  detail  the  potential  limitations  of  SCORE  with  respect  to  per  flow  solutions. 

In  addition  to  the  potential  benefit  of  being  able  to  implement  more  sophisticated  per  flow  ser¬ 
vices,  Intserv  has  two  other  advantages  over  SCORE: 

•  Robustness  While  the  SCORE  architecture  is  more  robust  in  the  case  of  fail-stop  failures, 
Intserv  is  more  robust  in  the  case  of  partial  reservation  failures.  To  illustrate  this  point,  con¬ 
sider  a  router  misbehavior  that  inserts  erroneous  state  in  the  packet  headers.  Since  core  routers 
process  packets  based  on  this  state,  such  a  failure  can,  at  the  limit,  compromise  the  service  in 
an  entire  SCORE  domain.  As  an  example,  if  in  the  network  shown  in  Figure  3.3,  router  1 100 
misbehaves  by  writing  arbitrary  information  in  the  packet  headers,  this  will  affect  not  only 
the  traffic  that  traverses  router  1 100,  but  also  the  traffic  that  traverses  router  1001!  This  is  due 
to  the  incorrect  state  carried  by  the  packets  of  flow  3  that  may  ultimately  affect  the  processing 
of  packets  of  other  flows  that  traverse  router  1 100.  In  contrast,  with  per  flow  solutions  such  a 
failure  is  strictly  confined  to  the  traffic  that  traverses  the  faulty  router. 

However,  in  Chapter  7  we  propose  an  approach  called  “verify-and-protect”  that  addresses  this 
problem.  The  idea  is  to  have  routers  statistically  verify  that  the  incoming  packets  carry  con¬ 
sistent  state.  This  enables  routers  to  discover  and  isolate  misbehaving  end-hosts  and  routers. 

•  Incremental  Deployability  Since  all  routers  in  a  domain  have  to  implement  the  same  al¬ 
gorithms,  SCORE  can  be  deployed  only  on  a  domain  by  domain  basis.  In  contrast,  Intserv 
solutions  can  be  deployed  on  a  router  by  router  basis.  However,  it  should  be  noted  that  for 
end-to-end  services,  this  distinction  is  less  important,  as  in  the  latter  case  (at  least)  all  con¬ 
gested  routers  along  the  path  have  to  deploy  the  service. 

3.3.2  Diffserv 

While  at  the  architectural  level  both  Diffserv  and  SCORE  are  similar  in  that  they  both  try  to  push 
complexity  out  of  the  network  core,  they  differ  in  two  important  aspects. 

First,  the  approach  advocated  by  the  two  architectures  to  implement  new  network  services  is 
different.  The  SCORE/DPS  approach  is  top-down.  We  start  with  a  service  and  then  derive  the 
algorithms  that  have  to  be  implemented  by  a  SCORE  network  in  order  to  achieve  the  desired  service. 
In  contrast,  Diffserv  proposes  a  bottom-up  approach.  Diffserv  standardizes  a  small  number  of  per 
hop  behaviors  (such  as  priority  service  among  a  very  small  number  of  classes)  to  be  implemented  by 
router  vendors.  It  is  then  the  responsibility  of  the  Internet  Service  Providers  (ISPs)  to  configure  their 
routers  in  order  to  achieve  the  desired  service.  Unfortunately,  configuring  these  routers  is  a  daunting 
task.  At  this  point  we  are  aware  of  no  general  framework  that  allows  us  to  build  sophisticated 
services  such  as  providing  flow  protection  by  simply  configuring  a  network  of  Diffserv  routers. 

Second,  while  in  Diffserv,  packet  headers  carry  only  limited  information  to  differentiate  among 
a  small  number  of  classes,  in  SCORE,  packets  carry  fine  grained  per  flow  state  which  allows  a 
SCORE  network  to  implement  far  more  sophisticated  services. 

Next  we  discuss  the  advantages  and  disadvantages  of  SCORE  as  compared  to  Diffserv. 
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3.3.2.1  SCORE  Advantages 


The  advantages  of  SCORE  over  Diffserv  derive  from  the  fact  that  the  DPS  algorithms  operate  at  a 
much  finer  granularity  both  in  terms  of  time  and  traffic  aggregates:  the  state  embedded  in  a  packet 
can  be  highly  dynamic,  as  it  encodes  the  current  state  of  the  flow,  rather  than  the  static  and  global 
properties  such  as  dropping  or  scheduling  priority. 

•  Service  Granularity  While  SCORE,  like  Intserv,  can  provide  services  on  a  per  flow  basis, 
Diffserv  provides  a  coarser  level  of  service  differentiation  among  a  small  number  of  traffic 
classes.  As  a  result,  Diffserv  cannot  provide  some  useful  seiwices  such  as  per  flow  bandwidth 
and  delay  guarantees  or  per  flow  protection. 

•  Robustness  The  extra  state  carried  by  the  packets  in  SCORE  can  help  to  identify  and  isolate 
malfunctions  in  the  network.  In  particular,  with  SCORE  it  is  possible  to  detect  a  router  that 
misbehaves  by  inserting  en'oneous  state  in  the  packet  headers.  To  achieve  this,  in  Chapter  7  we 
propose  an  approach,  called  “verify-and-protect”,  in  which  routers  statistically  verify  whether 
the  incoming  packets  are  con*ectly  marked.  For  example,  in  the  case  of  CSFQ,  a  router  can 
monitor  a  flow,  estimate  its  rate,  and  then  check  this  rate  against  the  rate  canned  by  the  packet 
headers.  If  the  two  rates  fall  outside  a  “tolerable”  range,  this  is  an  indication  that  an  up-stream 
router  misbehaves.  Thus,  the  problem  is  confined  to  the  routers  on  the  path  from  the  ingress 
where  the  flow  enters  the  network  up  to  the  current  router. 

In  contrast,  with  Diffserv  it  is  not  possible  to  infer  such  an  information.  If,  for  example,  a 
core  router  starts  to  drop  a  high  percentage  of  premium  packets  this  can  be  attributed  to  any 
router  along  any  path  from  the  ingress  routers  to  the  current  router. 

3.3.2.2  Diffserv  Advantages 

•  Data  Path  Processing  Overhead  In  Diffserv  core  routers  process  packets  based  on  a  small 
number  of  traffic  classes.  Upon  packet  arrival,  a  router  classifies  the  packet,  and  then  performs 
per  class  buffer  management  and  scheduling.  Since  usually  the  number  of  classes  is  no  larger 
than  10,  packet  processing  can  be  very  efficiently  implemented.  In  contrast,  in  SCORE, 
packet  processing  can  be  more  complex.  For  example,  in  the  case  of  providing  guaranteed 
seiwices,  each  packet  has  an  associated  deadline,  and  the  packets  are  served  in  the  increasing 
order  of  their  deadlines.  However,  as  we  will  show  in  Chapter  5,  the  number  of  packets 
that  have  to  be  considered  at  one  time  is  still  much  smaller  than  the  number  of  flows.  In 
particular,  when  the  number  of  flows  is  larger  than  one  million,  the  number  of  packets  is  at 
least  two  orders  of  magnitude  smaller  than  the  number  of  flows.  We  believe  that  this  reduction 
is  enough  to  allow  packet  processing  at  the  line  speed.  Moreover,  our  other  two  solutions  to 
provide  per  flow  services,  i.e.,  flow  protection,  and  service  differentiation  of  traffic  aggregates 
over  a  large  number  of  destinations,  are  no  more  complex  than  today’s  Diffserv  solutions. 


3.4  Summary 

In  this  section  we  have  described  the  main  components  of  our  solution  to  provide  per  flow  services  in 
a  SCORE  network  architecture.  To  illustrate  the  key  technique  of  our  solution,  i.e..  Dynamic  Packet 
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State  (DPS),  we  have  presented  the  implementation  of  three  per  flow  mechanisms  in  a  SCORE 
network  that  (1)  approximate  Fair  Queueing  scheduling  discipline,  (2)  provide  per  flow  admission 
control,  and  (3)  perform  route  pinning.  In  addition,  we  have  compai'ed  our  solution  to  two  network 
ai'chitectures  proposed  by  IETF  to  enhance  the  best-effoit  service  (Intseiw  and  Diffserv),  and  con¬ 
clude  that  our  solution  achieves  the  best  of  the  two  worlds.  In  particular,  it  can  provide  services  as 
powerful  and  as  flexible  as  the  ones  implemented  by  Intserv,  while  having  similar  complexity  and 
scalability  as  Diffsei*v. 
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Chapter  4 


Providing  Flow  Protection  in  SCORE 


In  this  chapter  we  present  the  first  illustration  of  our  general  solution  described  in  Chapter  3.  The 
goal  is  to  provide  flow  protection,  which  is  one  of  the  most  desirable  enhancements  of  today’s 
best-effort  service.  If  deployed,  flow  protection  would  allow  diverse  end-to-end  congestion  control 
schemes  to  seamlessly  coexist  in  the  Internet,  and  protect  well  behaved  traffic  against  malicious  or 
ill-behaved  traffic.  The  solution  of  choice  to  achieve  flow  protection  is  to  have  routers  implement  fair 
bandwidth  allocation.  Unfortunately,  previous  known  implementations  of  fair  bandwidth  allocation 
require  stateful  networks,  that  is,  require  routers  to  maintain  per  flow  state  and  perform  per  flow 
management.  In  this  chapter,  we  present  a  solution  to  address  this  problem.  In  particular  we  use  the 
Dynamic  Packet  State  (DPS)  technique  to  provide  fair  bandwidth  allocations  in  a  SCORE  network. 
To  the  best  of  our  knowledge  this  is  the  first  solution  to  provide  fair  bandwidth  allocation  in  a 
stateless  core  network. 

The  rest  of  this  chapter  is  organized  as  follows.  The  next  section  presents  the  motivations 
behind  the  flow  isolation  and  describes  the  fair  bandwidth  allocation  approach  to  implement  this 
service.  Section  4.2  outlines  our  solution  developed  in  the  SCORE/DPS  framework,  called  Core- 
Stateless  Fair  Queueing  (CSFQ).  Section  4.3  focusses  on  the  details  of  CSFQ  and  its  performance 
both  absolute  and  relative,  while  Section  4.4  presents  simulation  results  and  compare  CSFQ  to 
several  other  schemes.  Finally,  Section  4.5  presents  related  work,  and  Section  4.6  concludes  the 
chapter  by  summarizing  our  results. 


4.1  Background 

Because  of  their  reliance  on  statistical  multiplexing,  data  networks  such  as  the  Internet,  must  provide 
some  mechanism  to  control  network  congestion.  Network  congestion  occurs  when  the  rate  of  the 
traffic  arriving  at  a  link  exceeds  the  link  capacity.  The  current  Internet,  which  primarily  uses  FIFO 
queueing  and  drop-tail  mechanisms  in  its  routers,  relies  on  end-to-end  congestion  control  in  which 
end-hosts  reduce  their  transmission  rates  when  they  detect  that  the  network  is  congested.  The  most 
widely  utilized  form  of  end-to-end  congestion  control  is  Transport  Control  Protocol  (TCP)  [57], 
which  has  been  tremendously  successful  in  preventing  congestion  collapse. 

However,  the  effectiveness  of  this  approach  depends  on  one  fundamental  assumption:  end-hosts 
cooperate  by  implementing  homogeneous  congestion  control  algorithms.  In  other  words  these  al¬ 
gorithms  produce  similar  bandwidth  allocations  if  used  in  similar  circumstances.  In  today’s  Internet 
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this  is  equivalent  to  flows  being  “TCP-friendly”,  which  means  that  “their  arrival  rate  does  not  exceed 
that  of  any  TCP  connection  in  the  same  circumstances”  [36]. 

While  this  was  a  reasonable  assumption  in  the  past  when  the  Internet  was  primarily  used  by 
the  research  community,  and  the  vast  majority  of  traffic  was  TCP  based,  it  is  no  longer  true  today. 
In  particular,  this  assumption  can  be  violated  in  three  general  ways.  First,  some  applications  arc 
unresponsive  in  that  they  don't  implement  any  congestion  control  algorithms  at  all.  Most  of  the 
early  multimedia  and  multicast  applications,  like  vat  [59],  nv  [40],  vie  [70],  \vb  [58]  and  RealAu¬ 
dio  fall  into  this  category.  Another  example  would  be  malicious  users  mounting  denial  of  service 
attacks  by  blasting  unresponsive  traffic  into  the  network.  Second,  some  applications  use  congestion 
control  algorithms  that,  while  responsive,  are  not  TCP-friendly.  An  example  of  such  an  algorithm  is 
Receiver-driven  Layered  Multicast  (RLM)  [69].^  Third,  some  users  will  cheat  and  use  a  non-TCP 
congestion  control  algorithm  to  get  more  bandwidth.  An  example  of  this  would  be  using  a  modified 
form  of  TCP  with,  for  instance,  a  larger  initial  window  and  window  opening  constants. 

Starting  with  Nagle  [74],  many  researchers  observed  that  these  problems  can  be  overcome  when 
routers  have  mechanisms  that  allocate  bandwidth  in  a  fair  manner.  Fair  bandwidth  allocation  pro¬ 
tects  well-behaved  flows  from  the  ill-behaved  (unfriendly)  flows,  and  allows  a  diverse  set  of  end- 
to-end  congestion  control  policies  to  co-exist  in  the  network  [31].  To  differentiate  it  from  other 
approaches  (see  Section  4.5  for  an  alternative  approach)  that  deal  with  the  unfriendly  flow  problem 
we  call  this  approach  the  allocation  approach.  It  is  important  to  note  that  the  allocation  approach 
does  not  demand  that  all  flows  adopt  some  universally  standard  end-to-end  congestion  control  al¬ 
gorithm;  flows  can  choose  to  respond  to  the  congestion  in  whatever  manner  best  suits  them  without 
harming  other  flows.  Assuming  that  flows  prefer  not  to  have  significant  levels  of  packet  drop,  these 
allocation  approaches  give  an  incentive  for  flows  to  use  end-to-end  congestion  control,  because 
being  unresponsive  hurts  their  own  performance. 

While  the  allocation  approach  has  many  desirable  properties  for  congestion  control,  it  has  yet 
to  be  deployed  in  the  Internet.  One  of  the  main  reasons  behind  this  state  of  affairs  is  the  implemen¬ 
tation  complexity.  Until  now,  fair  allocations  were  typically  achieved  by  using  per  flow  queueing 
mechanisms  -  such  as  Fair  Queueing  [31,  79]  and  its  many  variants  [10,  45,  94]  -  or  per  flow 
dropping  mechanisms  such  as  Flow  Random  Early  Drop  (FRED)  [67].  These  mechanisms  are  sig¬ 
nificantly  more  complex  to  implement  than  the  traditional  FIFO  queueing  with  drop-tail,  which  is 
the  most  widely  implemented  and  deployed  mechanism  in  routers  today.  In  particular,  fair  allo¬ 
cation  mechanisms  inherently  require  the  router  to  maintain  state  and  perform  operations  on  a  per 
flow  basis.  For  each  packet  that  arrives  at  the  router,  the  routers  needs  to  classify  the  packet  into 
a  flow,  update  per  flow  state  variables,  and  perform  certain  operations  based  on  the  per  flow  state. 
The  operations  can  be  as  simple  as  deciding  whether  to  drop  or  queue  the  packet  (e.g.,  FRED),  or 
as  complex  as  manipulation  of  priority  queues  (e.g..  Fair  Queueing).  While  a  number  of  techniques 
have  been  proposed  to  reduce  the  complexity  of  the  per  packet  operations  [9,  94,  99],  and  com¬ 
mercial  implementations  are  available  in  some  intermediate  class  routers,  it  is  still  unclear  whether 
these  algorithms  can  be  cost-effectively  implemented  in  high-speed  backbone  routers  because  all 
these  algorithms  still  require  packet  classification  and  per  flow  state  management. 

^Although  our  data  in  Section  4.4  showed  RLM  receiving  less  than  its  fair  share,  when  we  change  the  simulation 
scenario  so  that  the  TCP  flow  starts  after  all  the  RLM  flows,  it  then  receives  less  than  half  of  its  fair  share.  This  hysteresis 
in  the  RLM  versus  TCP  behavior  was  first  pointed  out  to  us  by  Steve  McCanne  [69]. 
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Figure  4. 1 :  (a)  A  reference  stateful  network  that  provides  fair  bandwidth  allocation;  each  node  implements 
the  Fair  Queueing  algorithm,  (b)  A  SCORE  network  that  approximates  the  service  provided  by  the  reference 
network;  each  node  implements  our  algorithm,  called  Core-Siai^lcss  Fair  Queueing  (CSFQ). 


In  this  chapter,  we  address  the  complexity  problem  by  describing  a  solution  based  on  Dynamic 
Packet  State  to  provide  fair  bandwidth  allocation  within  a  SCORE  domain.  We  call  this  solution 
Core-Stateless  Fair  Queueing  (CSFQ)  since  the  core  routers  keep  no  per  flow  state,  but  instead  use 
the  state  that  is  canied  in  the  packet  labels. 


4.2  Solution  Outline 

Existing  solutions  to  provide  fair  bandwidth  allocation  require  routers  to  maintain  per  flow  state  [10, 
31,  45,  79,  94].  In  this  chapter,  we  present  the  first  solution  to  achieve  fair  bandwidth  allocation  in 
a  network  in  which  core  routers  maintain  no  per  flow  state.  Our  solution  is  based  on  the  generic 
approach  described  in  Section  3.1.2.  This  approach  consists  of  two  steps.  In  the  first  step  we  define 
a  reference  network  that  provides  fair  bandwidth  allocation  by  having  each  node  implement  the  Fair 
Queueing  (see  Figure  4.1(a))  algorithm.  In  the  second  step,  we  approximate  the  service  provided 
by  the  reference  network  within  a  SCORE  network  (see  Figure  4.1(b)).  To  achieve  this  we  use  the 
Dynamic  Packet  State  (DPS)  technique  to  implement  a  novel  algorithm,  called  C<9r^-Stateless  Fair 
Queueing  (CSFQ),  which  approximates  the  behavior  of  Fair  Queueing. 

With  CSFQ,  edge  routers  use  per  flow  state  to  estimate  the  rate  of  each  incoming  flow.  Upon 
a  packet  arrival,  the  edge  router  classifies  the  packet  to  the  appropriate  flow,  updates  the  flow’s 
rate  estimate,  and  then  labels  the  packet  with  this  estimate.  In  turn,  core  routers^  implement  FIFO 
queueing  with  probabilistic  dropping  on  input.  The  probability  of  dropping  a  packet  as  it  arrives  at 
the  queue  is  a  function  of  the  rate  estimate  carried  in  the  label  and  of  the  fair  share  rate  at  that  router, 
which  is  estimated  based  on  measurements  of  the  aggregate  traffic.  When  the  packet  is  forwarded 
the  router  may  update  the  estimate  carried  by  the  packet  to  reflect  the  eventual  change  in  the  flow’s 
rate  due  to  packet  dropping.  In  this  way,  CSFQ  avoids  both  the  need  to  maintain  per  flow  state  and 
the  need  to  use  complicated  packet  scheduling  and  buffering  algorithms  at  core  routers. 

'Note  that  Example  1  in  Section  3.1.3  outlines  the  CSFQ  algorithm  as  implemented  by  core  routers. 
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4.3  Core-Stateless  Fair  Queueing  (CSFQ) 


In  this  section  we  present  our  algorithm,  called  Core-Stateless  Fair  Queueing  (CSFQ),  which  ap¬ 
proximates  the  behavior  of  Fair  Queueing  in  a  SCORE  network.  To  offer  intuition  about  how  CSFQ 
works,  we  first  present  the  idealized  bit-by-bit  or  fluid  version  of  the  probabilistic  dropping  scheme, 
and  then  extend  the  algorithm  to  a  practical  packet-by-packet  version. 

4.3.1  Fluid  Model  Algorithm 

We  first  consider  a  buffeiiess  fluid  model  of  a  router  with  output  link  speed  C,  where  the  flows  are 
modelled  as  a  continuous  stream  of  bits.  We  assume  each  flow's  amval  rate  r^t)  is  known  preci.sely. 
Max-min  fair  bandwidth  allocations  are  characterized  by  the  fact  that  all  flows  that  are  bottlenecked 
(i.e.,  have  bits  dropped)  by  this  router  have  the  same  output  rate.  We  call  this  rate  {he  fair  share  rate 
of  the  server;  let  a(t)  be  the  fair  share  rate  at  time  t.  In  general,  if  max-min  bandwidth  allocations 
are  achieved,  each  flow  i  receives  service  at  a  rate  given  by  nmif-ft).  a(t)).  Let  A(t)  denote  the 
total  anival  rate:  A(t)  =  If  A{t)  >  C  then  the  fair  share  a{t)  is  the  unique  solution  to 
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C  =  ^miii(7v(Q.o(Q),  (4.1) 

!=! 

If  A{t)  <  C  then  no  bits  are  dropped  and  we  will,  by  convention,  set  a{t)  —  max;  I'lit), 

If  ^  i-e.,  flow  i  sends  no  more  than  the  server’s  fair  share  rate,  all  of  its  traffic  will 

be  forwarded.  If  ri{t)  >  a{t),  then  a  fraction  of  its  bits  will  be  dropped,  so  it  will  have 

an  output  rate  of  exactly  a{t).  This  suggests  a  very  simple  probabilistic  forwarding  algorithm  that 
achieves  fair  allocation  of  bandwidth:  each  incoming  bit  of  flow  i  is  dropped  with  the  probability 

max  (o.  1  -  (42) 

When  these  dropping  probabilities  are  used,  the  arrival  rate  of  flow  i  at  the  next  hop  is  given  by 
min[r,;(/),o'(f)]. 

4.3.2  Packet  Algorithm 

The  above  algorithm  is  defined  for  a  bufferless  fluid  system  in  which  the  arrival  rates  are  known 
exactly.  Our  task  now  is  to  extend  this  approach  to  the  situation  in  real  routers  where  transmission 
is  packetized,  there  is  substantial  buffering,  and  the  arrival  rates  are  not  known. 

We  still  employ  a  drop-on-input  scheme,  except  that  now  we  drop  packets  rather  than  bits. 
Because  the  rate  estimation  (described  below)  incorporates  the  packet  size,  the  dropping  probability 
is  independent  of  the  packet  size  and  depends  only,  as  above,  on  the  rate  r;(^)  and  fair  share  rate 
a(t). 

We  are  left  with  two  remaining  challenges:  estimating  the  rates  ri{t)  and  the  fair  share  a{t).  We 
address  these  two  issues  in  turn  in  the  next  two  subsections,  and  then  discuss  the  rewriting  of  the 
labels.  Pseudocode  reflecting  this  algorithm  is  described  in  Figures  4.3  and  4.4. 
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Figure  4.2:  The  architecture  of  the  output  port  of  an  edge  router,  and  a  core  router,  respectively. 


4*3.2. 1  Computation  of  Flow  Arrival  Rate 


Recall  that  in  our  architecture,  the  rates  ri{t)  are  estimated  at  the  edge  routers  and  then  these  rates 
are  inserted  into  the  packet  labels.  At  each  edge  router,  we  use  exponential  averaging  to  estimate  the 
rate  of  a  flow.  Let  and  Zf  be  the  airival  time  and  length  of  the  packet  of  flow  i.  The  estimated 
rate  of  flow  i,  r^,  is  updated  every  time  a  new  packet  is  received: 


new 
'  i 


(1  - 


(4.3) 


where  —  ^  and  K  is  a  constant.  We  discuss  the  rationale  for  using  the  form  e  for 

the  exponential  weight  in  Section  4.3.7. 


4.3.2.2  Link  Fair  Rate  Estimation 

In  this  section,  we  present  an  estimation  algorithm  for  a{t).  To  give  intuition,  consider  again  the 
fluid  model  in  Section  4.3.1  where  the  arrival  rates  are  known  exactly,  and  assume  the  system  per¬ 
forms  the  probabilistic  dropping  algorithm  according  to  Eq.  (4.2).  Then,  the  rate  with  which  the 
algorithm  accepts  packets  is  a  function  of  the  current  estimate  of  the  fair  share  rate,  which  we  de¬ 
note  by  a(f).  Letting  F{a{t))  denote  this  acceptance  rate,  we  have 

F{a{t))  =  min  a{t)) .  (4.4) 

i=l 

Note  that  F{-)  is  a  continuous,  nondecreasing,  concave,  and  piecewise-linear  function  of  a.  If  the 
link  is  congested  {A{t)  >  C)  we  choose  a{t)  to  be  the  unique  solution  to  F{x)'  =  (7.  If  the  link 
is  not  congested  (A{t)  <  C)  we  take  a{t)  to  be  the  largest  rate  among  the  flows  that  traverse  the 
link,  i.e.,  a{t)  =  maxi<j<„{rj(f)).  From  Eq.  (4.4)  note  that  if  we  knew  the  arrival  rates  ri{t)  we 
could  then  compute  a{t)  directly.  To  avoid  having  to  keep  such  per  flow  state,  we  seek  instead  to 
implicitly  compute  a{t)  by  using  only  aggregate  measurements  of  F  and  A. 

We  use  the  following  heuristic  algorithm  with  three  aggregate  state  variables:  a,  the  estimate 
for  the  fair  share  rate;  A,  the  estimated  aggregate  arrival  rate;  F,  the  estimated  rate  of  the  accepted 
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on  packet  p  arrival 
if  (edge  router) 
i  “ classify (;>); 

p.label  =  estimate_rate(7’/.p);  /*  use  Eq.  (4.3)  */ 
prob  =max(0, 1  —  a/p.}(ihc})\ 
if  {prob  >uniLrand(0,  1)) 
a  =  estimate _n  (77, 1); 
drop(p); 
else 

(y  =  estimate _n  {]).  0); 
enqueue(p); 
if  {])rob  >  0) 

p.label  =  n;  /*  relabel p  */ 


Figure  4.3:  The  pseudocode  of  CSFQ. 

traffic.  The  last  two  variables  are  updated  upon  the  airival  of  each  packet.  For  ^  we  use  exponential 
averaging  with  a  parameter  where  T  is  the  inter-annval  time  between  the  cuiTent  and  the 

previous  packet: 

=  ( 1  -  ^  (4.5) 

where  Aoui  is  the  value  of  A  before  the  updating.  We  use  an  analogous  formula  to  update  F. 

The  updating  rule  for  a  depends  on  whether  the  link  is  congested  or  not.  To  filter  out  the  esti- 
mation  inaccuracies  due  to  exponential  smoothing  we  use  a  window  of  size  K^.  A  link  is  assumed 
to  be  congested,  \i  A>C  at  all  times  during  an  interval  of  length  Conversely,  a  link  is  assumed 
to  be  uncongested,  if  A  <  C  at  all  times  during  an  interval  of  length  AV-  The  value  a  is  updated 
only  at  the  end  of  an  interval  in  which  the  link  is  either  congested  or  uncongested  according  to 
these  definitions.  If  the  link  is  congested  then  a  is  updated  based  on  the  equation  F{a)  —  C\  We 
approximate  F(')  by  a  linear  function  that  intersects  the  origin  and  has  slope  F This  yields 

^  C 

(4.6) 

r 

If  the  link  is  not  congested,  is  set  to  the  largest  rate  of  any  active  flow  (i.e.,  the  largest  label 
seen)  during  the  last  K(.  time  units.  The  value  of  a  new  is  then  used  to  compute  dropping  probabil¬ 
ities,  according  to  Eq.  (4.2).  For  completeness,  we  give  the  pseudocode  of  the  CSFQ  algorithm  in 
Figure  4.4. 

We  now  describe  two  minor  amendments  to  this  algorithm  related  to  how  the  buffers  are  man¬ 
aged.  The  goal  of  estimating  the  fair  share  a  is  to  match  the  accepted  rate  to  the  link  bandwidth.  Due 
to  estimation  inaccuracies,  load  fluctuations  between  a's  updates,  and  the  probabilistic  nature  of  our 
algorithm,  the  accepted  rate  may  occasionally  exceed  the  link  capacity.  While  ideally  the  router’s 
buffers  can  accommodate  the  extra  packets,  occasionally  the  router  may  be  forced  to  drop  the  in¬ 
coming  packet  due  to  lack  of  buffer  space.  Since  drop-tail  behavior  will  defeat  the  purpose  of  our 
algorithm,  and  may  exhibit  undesirable  properties  in  the  case  of  adaptive  flows  such  as  TCP  [37],  it 
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estimate_a  (p,  dropped) 

estimate_rate( A, p);  /*  est.  arrival  rate  (use  Eq,  */ 
if  (dropped  ==  FALSE) 
estimate-rate(F,  p);  /*  est,  accepted  traffic  rate  */ 
if  (A  >  C) 

^  if  (congested  FALSE) 

congested  —  TRUE; 
startJime  —  crt-time; 

»  else 

if  (crt-time  >  start -time  +  Kc) 
a  ~  a  X  C/F; 
start -time  ~  crt-time; 
else  A  <  C  */ 
if  (congested  — =  TRUE) 
congested  =  FALSE; 
start-time  =  crt-time; 
tmp-a  =  0;  /*  use  to  compute  new  a  */ 
else 

if  (crt-time  <  start -time  +  Kc) 
tmp-a  =niax(tmp-a,p.label); 

else 

a  =  tmp-a; 

start-time  ~  crt-time; 
tmp-a  =  0; 

return  a; 


Figure  4.4:  The  pseudocode  of  CSFQ  (fair  rate  estimation). 

is  important  to  limit  its  effect.  To  do  so,  we  use  a  simple  heuristic:  every  time  the  buffer  overflows, 
S  is  decreased  by  a  small  fixed  percentage  (taken  to  be  1%  in  our  simulations).  Moreover,  to  avoid 
overcorrection,  we  make  sure  that  during  consecutive  updates  a  does  not  decrease  by  more  than 
25%. 

In  addition,  since  there  is  little  reason  to  consider  a  link  congested  if  the  buffer  is  almost  empty, 
we  apply  the  following  rule.  If  the  link  becomes  uncongested  by  the  test  in  Figure  4.4,  then  we 
assume  that  it  remains  uncongested  as  long  as  the  buffer  occupancy  is  less  than  some  predefined 
threshold.  In  the  current  implementation  we  use  a  threshold  that  is  half  of  the  total  buffer  capacity. 

43,2.3  Label  Rewriting 

Our  rate  estimation  algorithm  in  Section  4.3.2. 1  allows  us  to  label  packets  with  their  flow’s  rate  as 
they  enter  the  SCORE  domain.  Our  packet  dropping  algorithm  described  in  Section  4.3.2.2  allows 
us  to  limit  flows  to  their  fair  share  of  the  bandwidth.  After  a  flow  experiences  significant  losses  at  a 
congested  link  inside  the  domain,  however,  the  packet  labels  are  no  longer  an  accurate  estimate  of 
its  rate.  We  cannot  rerun  our  estimation  algorithm,  because  it  involves  per  flow  state.  Fortunately, 
as  noted  in  Section  4.3.1,  the  outgoing  rate  is  merely  the  incoming  rate  or  the  fair  rate,  a,  whichever 
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is  smaller.  Therefore,  we  rewrite  the  the  packet  label  L  as 


=  inin{L„w.fv),  (4.7) 

By  doing  so,  the  outgoing  flow  rates  will  be  properly  represented  by  the  packet  labels. 

4.3.3  Weighted  CSFQ 

The  CSFQ  algorithm  can  be  extended  to  support  flows  with  different  weights.  Let  ivi  denote  the 
weight  of  flow  i.  Returning  to  our  fluid  model,  the  meaning  of  these  weights  is  that  we  say  a  fair 
allocation  is  one  in  which  all  bottlenecked  flows  have  the  same  value  for  Then,  if  A{t)  >  (7,  the 

normalized  fair  rate  rv(^)  is  the  unique  value  such  that  YJj-^  Wj  ruin  ^  j  =  C.  The  expression 

for  the  dropping  probabilities  in  the  weighted  case  is  max  ^0, 1  -  The  only  other  major 

change  is  that  the  label  is  now  r^j/vij,  instead  simply  Finally,  without  going  into  detail  we  note 
that  the  weighted  packet-by-packet  version  is  virtually  identical  to  the  con*esponding  version  of  the 
plain  CSFQ  algorithm. 

It  is  also  important  to  note  that  with  weighted  CSFQ  we  can  only  approximate  a  reference  net¬ 
work  in  which  each  flow  has  the  same  weight  at  all  routers  along  its  path.  That  is,  our  algorithm  can¬ 
not  accommodate  situations  where  the  relative  weights  of  flows  differ  from  router  to  router  within 
a  domain.  However,  even  with  this  limitation,  weighted  CSFQ  may  prove  a  valuable  mechanism  in 
implementing  differential  services,  such  as  the  one  proposed  in  [1 16]. 

4.3.4  Performance  Bounds 

We  now  present  the  main  theoretical  result  for  CSFQ.  For  generality,  this  result  is  given  for  weighted 
CSFQ.  The  proof  is  given  in  Appendix  A. 

Our  algorithm  is  built  around  several  estimation  procedures,  and  thus  is  inherently  inexact.  One 
natural  concern  is  whether  a  flow  can  purposely  “exploit”  these  inaccuracies  to  get  more  than  its 
fair  share  of  bandwidth.  We  cannot  answer  this  question  with  full  generality,  but  we  can  analyze  a 
simplified  situation  where  the  normalized  fair  share  rate,  a,  is  held  fixed  and  there  is  no  buffering, 
so  the  drop  probabilities  are  precisely  given  by  Eq.  (4.2).  In  addition,  we  assume  that  when  a  packet 
arrives,  a  fraction  of  that  packet  equal  to  the  flow's  forwarding  probability  is  transmitted.  Note  that 
during  any  time  interval  [^1,^2)  a  flow  with  weight  w  is  entitled  to  receive  at  most  vm{t2  —  t\) 
service  time;  we  call  any  amount  above  this  the  excess  sendee.  This  excess  service  can  be  bound, 
independent  of  both  the  arrival  process  and  the  length  of  the  time  interval  during  which  the  flow  is 
active.  The  bound  does  depend  crucially  on  the  maximal  rate,  R,  at  which  a  flow  packets  can  arrive 
at  a  router  (limited,  for  example,  by  the  speed  of  the  flow’s  access  link);  the  smaller  this  rate  R,  the 
tighter  the  bound. 

Theorem  1  Consider  a  link  with  a  constant  normalized  fair  rate  a,  and  a  flow  with  weight  w.  Them 
the  excess  service  received  by  a  flow  with  weight  w  that  sends  at  a  rate  no  larger  than  R,  is  bounded 
above  by 


where  =  aw,  and  Imax  represents  the  maximum  length  of  a  packet. 

By  bounding  the  excess  service,  we  have  shown  that  in  this  idealized  setting,  the  asymptotic 
throughput  cannot  exceed  the  fair  share  rate.  Thus,  flows  can  only  exploit  the  system  over  short 
time  scales;  they  are  limited  to  their  fair  share  over  long  time  scales. 

4.3.5  Implementation  Complexity 

At  core  routers,  both  the  time  and  space  complexity  of  our  algorithm  are  constant  with  respect  to  the 
number  of  competing  flows,  and  thus  we  think  CSFQ  could  be  implemented  in  very  high  speed  core 
routers.  At  each  edge  router  CSFQ  needs  to  maintain  per  flow  state.  Upon  the  arrival  of  each  packet, 
the  edge  router  needs  to  (1)  classify  the  packet  to  a  flow,  (2)  update  the  fair  share  rate  estimation  for 
the  corresponding  outgoing  link,  (3)  update  the  flow  rate  estimation,  and  (4)  label  the  packet.  All 
these  operations  with  the  exception  of  packet  classification  can  be  efficiently  implemented  today. 

Efficient  and  general-purpose  packet  classification  algorithms  are  still  under  active  research.  We 
expect  to  leverage  these  results.  We  also  note  that  packet  classification  at  ingress  nodes  is  needed  for 
a  number  of  other  purposes,  such  as  in  the  context  of  Multiprotocol  Label  Switching  (MPLS)  [17] 
or  for  accounting  purposes.  Therefore,  the  classification  required  for  CSFQ  may  not  be  an  extra 
cost.  In  addition,  edge  routers  typically  not  on  the  high-speed  backbone  links  pose  no  problem  as 
classification  at  moderate  speeds  is  quite  practical. 

43.6  Architectural  Considerations 

We  have  used  the  term  flow  without  defining  what  we  mean.  This  was  intentional,  as  the  CSFQ 
approach  can  be  applied  to  varying  degrees  of  flow  granularity;  that  is,  what  constitutes  a  flow  is 
arbitrary  as  long  as  all  packets  in  the  flow  follow  the  same  path  within  the  core.  For  convenience, 
flow  as  used  here  is  implicitly  defined  as  a  source-destination  pair,  but  one  could  easily  assign  fair 
rates  to  many  other  granularities  such  as  source-destination-ports.  Moreover,  the  unit  of  “flow”  can 
vary  from  domain  to  domain  as  long  as  the  rates  are  re-estimated  when  entering  a  new  domain. 

Similarly,  we  have  not  been  precise  about  the  size  of  the  SCORE  domains.  In  one  extreme,  we 
could  take  each  router  as  a  domain  and  estimate  rates  at  every  router;  this  would  allow  us  to  avoid 
the  use  of  complicated  per  flow  scheduling  and  dropping  algorithms,  but  would  require  per  flow 
classification.  Another  possibility  is  that  ISPs  could  extend  their  SCORE  domain  to  the  very  edge 
of  their  network,  having  their  edge  routers  at  the  points  where  customer’s  packets  enter  the  ISP’s 
network.  Building  on  the  previous  scenario,  multiple  ISPs  could  combine  their  domains  so  that 
classification  and  estimation  did  not  have  to  be  performed  at  ISP-ISP  boundaries.  The  key  obstacle 
here  is  one  of  trust  between  ISPs. 

4.3.7  Miscellaneous  Details 

Having  presented  the  basic  CSFQ  algorithm,  we  now  return  to  discuss  a  few  aspects  in  more  detail. 

We  have  used  exponential  averaging  to  estimate  the  arrival  rate  in  Eq.  (4.3).  However,  instead  of 
using  a  constant  exponential  weight  we  used  where  T  is  the  inter-packet  arrival  time  and  K 

is  a  constant.  Our  motivation  was  that  more  closely  reflects  a  fluid  averaging  process  which 
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is  independent  of  the  packetizing  structure.  More  specifically,  it  can  be  shown  that  if  a  constant 
weight  is  used,  the  estimated  rate  will  be  sensitive  to  the  packet  length  distribution  and  there  are 
pathological  cases  where  the  estimated  rate  differs  from  the  real  amval  rate  by  a  factor;  this  would 
allow  flows  to  exploit  the  estimation  process  and  obtain  more  than  their  fair  share.  In  contrast,  by 
using  a  parameter  of  the  estimated  rate  will  asymptotically  converge  to  the  real  rate,  and 

this  allows  us  to  bound  the  excess  service  that  can  be  achieved  (as  in  Theorem  1 ).  We  used  a  similar 
averaging  process  in  Eq.  (4.5)  to  estimate  the  total  airival  rate  A. 

The  choice  of  K  in  the  above  expression  presents  us  with  several  tradeoffs.  First,  while 

a  smaller  K  increases  system  responsiveness  to  rapid  rate  fluctuations,  a  larger  K  better  filters  noise 
and  avoids  potential  system  instability.  Second,  K  should  be  large  enough  such  that  the  estimated 
rate,  calculated  at  the  edge  of  the  network,  remains  reasonably  accurate  after  a  packet  traverses 
multiple  links.  This  is  because  the  delay-jitter  changes  the  packets'  inter-anival  pattern,  which  may 
result  in  an  increased  discrepancy  between  the  estimated  rate  (received  in  the  packets'  labels)  and 
the  real  rate.  To  counteract  this  effect,  as  a  rule  of  thumb,  K  should  be  one  order  of  magnitude 
larger  that  the  delay-jitter  experienced  by  a  flow  over  a  time  interval  of  the  same  size,  K.  Third,  K 
should  be  no  larger  than  the  average  duration  of  a  flow.  Based  on  this  constraints,  an  appropriate 
value  for  K  would  be  between  100  and  500  ms. 


4.4  Simulation  Results 

In  this  section  we  evaluate  our  algorithm  by  simulation.  To  provide  some  context,  we  compare 
CSFQ's  performance  to  three  additional  algorithms.  Two  of  these,  FIFO  and  RED,  represent  base¬ 
line  cases  where  routers  do  not  attempt  to  achieve  fair  bandwidth  allocations.  The  other  two  algo¬ 
rithms,  FRED  and  DRR,  represent  different  approaches  to  achieving  fairness. 

•  FIFO  (First  In  First  Out)  -  Packets  are  served  in  a  first-in  first-out  order,  and  the  buffers  are 
managed  using  a  simple  drop-tail  strategy;  i.e.,  incoming  packets  are  dropped  when  the  buffer 
is  full. 

•  RED  (Random  Early  Detection)  -  Packets  are  served  in  a  first-in  first-out  order,  but  the  buffer 
management  is  significantly  more  sophisticated  than  drop-tail.  RED  [37]  starts  to  probabilis¬ 
tically  drop  packets  long  before  the  buffer  is  full,  providing  early  congestion  indication  to 
flows  which  can  then  gracefully  back-off  before  the  buffer  overflows.  RED  maintains  two 
buffer  thresholds.  When  the  exponentially  averaged  buffer  occupancy  is  smaller  than  the  first 
threshold,  no  packet  is  dropped,  and  when  the  exponentially  averaged  buffer  occupancy  is 
larger  than  the  second  threshold  all  packets  are  dropped.  When  the  exponentially  averaged 
buffer  occupancy  is  between  the  two  thresholds,  the  packet  dropping  probability  increases 
linearly  with  buffer  occupancy. 

•  FRED  (Fair  Random  Early  Drop)  -  This  algorithm  extends  RED  to  provide  some  degree  of 
fair  bandwidth  allocation  [67].  To  achieve  fairness,  FRED  maintains  state  for  all  flows  that 
have  at  least  one  packet  in  the  buffer.  Unlike  RED  where  the  dropping  decision  is  based  only 
on  the  buffer  state,  in  FRED  dropping  decisions  are  based  on  this  flow  state.  Specifically, 
FRED  preferentially  drops  a  packet  of  a  flow  that  has  either  ( 1 )  had  many  packets  dropped 
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in  the  past,  or  (2)  a  queue  larger  than  the  average  queue  size.  FRED  has  two  variants  (which 
we  will  call  FRED-1  and  FRED-2).  The  main  difference  between  the  two  is  that  FRED-2 
guarantees  to  each  flow  a  minimum  number  of  buffers.  As  a  general  rule,  FRED-2  performs 
better  than  FRED-1  only  when  the  number  of  flows  is  lai*ge.  In  the  following  data,  when  we 
don’t  distinguish  between  the  two,  we  are  quoting  the  results  from  the  version  of  FRED  which 
performed  the  best. 

•  DRR  (Deficit  Round  Robin)  -  This  algorithm  represents  an  efficient  implementation  of  the 
well-known  weighted  fair  queueing  (WFQ)  discipline.  The  buffer  management  scheme  as¬ 
sumes  that  when  the  buffer  is  full  the  packet  from  the  longest  queue  is  dropped.  DRR  is  the 
only  one  of  the  four  to  use  a  sophisticated  per-flow  queueing  algorithm,  and  thus  achieves  the 
highest  degree  of  fairness. 

These  four  algorithms  represent  four  different  levels  of  complexity.  DRR  and  FRED  have  to 
classify  incoming  flows,  whereas  FIFO  and  RED  do  not.  In  addition,  DRR  has  to  implement  its 
packet  scheduling  algorithm  (whereas  the  rest  all  use  first-in-first-out  scheduling).  CSFQ  edge 
routers  have  complexity  comparable  to  FRED,  and  CSFQ  core  routers  have  complexity  comparable 
to  RED. 

We  have  examined  the  behavior  of  CSFQ  under  a  variety  of  conditions.  We  use  an  assortment 
of  traffic  sources  (mainly  CBR  and  TCP,  but  also  some  on-off  sources)  and  topologies.  For  space 
reasons,  we  only  report  on  a  small  sampling  of  the  simulations  we  have  run;  a  fuller  set  of  tests,  and 
the  scripts  used  to  run  them,  is  available  at  http :  / /www.  cs  .  emu .  edu/'"istoica/ csfq.  All 
simulations  were  performed  in  ns-2  [78],  which  provides  accurate  packet-level  implementation  for 
various  network  protocols,  such  as  TCP  and  RLM  (Receiver-driven  Layered  Multicast)  [69],  and 
various  buffer  management  and  scheduling  algorithms,  such  as  RED  and  DRR. 

Unless  otherwise  specified,  we  use  the  following  parameters  for  the  simulations  in  this  section. 
Each  output  link  has  a  latency  of  1  ms,  a  buffer  of  64  KB,  and  a  buffer  threshold  for  CSFQ  of  16 
KB.  In  the  RED  and  FRED  cases,  the  first  threshold  is  set  to  16  KB,  while  the  second  one  is  set  to 
32  KB.  The  averaging  constant  used  in  estimating  the  flow  rate  is  K  =  100  ms,  while  the  averaging 
constant  used  in  estimation  the  fair  rate  a  is  Ka  =  200  ms.  Finally,  in  all  topologies  we  use  the  first 
router  (gateway)  on  the  path  of  a  flow  is  always  assumed  to  be  the  edge  router;  all  other  routers  are 
assumed  without  exception  to  be  core  routers. 

We  simulated  the  other  four  algorithms  to  give  us  benchmarks  against  which  to  assess  these 
results.  We  use  DRR  as  our  model  of  fairness  and  use  the  baseline  cases,  FIFO  and  RED,  as  repre¬ 
senting  the  (unfair)  status  quo.  The  goal  of  these  experiments  is  determine  where  CSFQ  sits  between 
these  two  extremes.  FRED  is  a  more  ambiguous  benchmark,  being  somewhat  more  complex  than 
CSFQ,  but  not  as  complex  as  DRR. 

In  general,  we  find  that  CSFQ  achieves  a  reasonable  degree  of  fairness,  significantly  closer 
to  DRR  than  to  FIFO  or  RED.  CSFQ’s  performance  is  typically  comparable  to  FRED’s,  although 
there  are  a  few  situations  where  CSFQ  significantly  outperforms  FRED.  There  are  a  large  number  of 
experiments  and  each  experiment  involves  rather  complex  dynamics.  Due  to  space  limitations,  in  the 
sections  that  follow  we  will  merely  highlight  a  few  important  points  and  omit  detailed  explanations 
of  the  dynamics. 
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(a) 


(b) 


Figure  4.5:  (a)  A  10  Mbps  link  shared  by  N  flows,  (b)  The  average  throughput  over  10  sec  when  N  =  32,  and 
all  flows  are  CBRs.  The  arrival  rate  for  flow  i  is  (i  +  1 )  times  larger  than  its  fair  share.  The  flows  are  indexed 
from  0. 


4,4,1  A  Single  Congested  Link 

We  first  consider  a  single  congested  link  shared  by  N  flows  (see  Figure  4.5(a)).  We  performed  three 
related  experiments. 

In  the  first  experiment,  we  have  32  CBR  flows,  indexed  from  0,  where  flow  i  sends  i  +  1  times 
more  than  its  fair  share  of  0.3125  Mbps.  Thus  flow  0  sends  0.3125  Mbps,  flow  1  sends  0.625  Mbps, 
and  so  on.  Figure  4.5(b)  shows  the  average  throughput  of  each  flow  over  a  10  sec  interval;  FIFO, 
RED,  and  FRED-1  fail  to  ensure  fairness,  with  each  flow  getting  a  share  proportional  to  its  incoming 
rate,  while  DRR  is  extremely  effective  in  achieving  a  fair  bandwidth  distribution.  CSFQ  and  FRED- 
2  achieve  a  less  precise  degree  of  fairness;  for  CSFQ  the  throughputs  of  all  flows  are  between  —11% 
and  +12%  of  the  ideal  value. 

In  the  second  experiment  we  consider  the  impact  of  an  ill-behaved  CBR  flow  on  a  set  of  TCP 
flows.  More  precisely,  the  traffic  of  flow  0  comes  from  a  CBR  source  that  sends  at  10  Mbps,  while  all 
the  other  flows  (from  1  to  31)  are  TCPs.  Figure  4,6(a)  shows  the  throughput  of  each  flow  averaged 
over  a  10  sec  interval.  The  only  two  algorithms  that  can  most  effectively  contain  the  CBR  flow 
are  DRR  and  CSFQ.  Under  FRED  the  CBR  flow  gets  almost  1.8  Mbps  -  close  to  six  times  more 
than  its  fair  share  -  while  the  CBR  only  gets  0.396  Mbps  and  0.355  Mbps  under  DRR  and  CSFQ, 
respectively.  As  expected,  FIFO  and  RED  perform  poorly,  with  the  CBR  flow  getting  over  8  Mbps 
in  both  cases. 

In  the  final  experiment,  we  measure  how  well  the  algorithms  can  protect  a  single  TCP  flow 
against  multiple  ill-behaved  flows.  We  perform  31  simulations,  each  for  a  different  value  of  N, 
N  ~  1 ...  31.  In  each  simulation  we  take  one  TCP  flow  and  N  CBR  flows;  each  CBR  sends  at 
twice  its  fair  share  rate  of  ;^^^Mbps.  Figure  4.6(b)  plots  the  ratio  between  the  average  throughput 
of  the  TCP  flow  over  10  sec  and  the  total  bandwidth  it  should  receive  as  a  function  of  the  total 
number  of  flows  in  the  system  N  +  1.  There  are  three  points  of  interest.  First,  DRR  performs 
very  well  when  there  are  less  than  22  flows,  but  its  performances  decreases  afterwards  because 
then  the  TCP  flow’s  buffer  share  is  less  than  three  buffers  which  is  known  to  significantly  affect  its 
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Figure  4.6:  (a)  The  throughputs  of  one  CBR  flow  (0  indexed)  sending  at  10  Mbps,  and  of  31  TCP  flows 
sharing  a  10  Mbps  link,  (b)  The  normalized  bandwidth  of  a  TCP  flow  that  competes  with  N  CBR  flows 
sending  at  twice  their  allocated  rates,  as  a  function  of  N. 


throughput.  Second,  CSFQ  performs  better  than  DRR  when  the  number  of  flows  is  large.  This  is 
because  CSFQ  is  able  to  cope  better  with  the  TCP  burstiness  by  allowing  the  TCP  flow  to  have  more 
than  two  packets  buffered  for  short  time  intervals.  Finally,  across  the  entire  range,  CSFQ  provides 
similar  or  better  performance  as  compared  to  FRED. 

4.4.2  Multiple  Congested  Links 

We  now  analyze  how  the  throughput  of  a  well-behaved  flow  is  affected  when  the  flow  traverses  more 
than  one  congested  link.  We  performed  two  experiments  based  on  the  topology  shown  in  Figure  4.7. 
All  CBRs,  except  CBR-0,  send  at  2  Mbps.  Since  each  link  in  the  system  has  10  Mbps  capacity,  this 
will  result  in  all  links  between  routers  being  congested. 

In  the  first  experiment,  we  have  a  CBR  flow  (denoted  CBR-0)  sending  at  its  fair  share  rate 
of  0.909  Mbps.  Figure  4.8(a)  shows  the  fraction  of  CBR-O’s  traffic  that  is  forwarded,  versus  the 
number  of  congested  links.  CSFQ  and  FRED  perform  reasonably  well,  although  not  quite  as  well 
as  DRR. 

In  the  second  experiment  we  replace  CBR-0  with  a  TCP  flow.  Similarly,  Figure  4.8(b)  plots  the 
normalized  TCP  throughput  against  the  number  of  congested  links.  Again,  DRR  and  CSFQ  prove 
to  be  effective.  In  comparison,  FRED  performs  significantly  worse  though  still  much  better  than 
RED  and  FIFO.  The  reason  is  that  while  DRR  and  CSFQ  tries  to  allocate  bandwidth  fairly  among 
competing  flows  during  congestion,  FRED  tries  to  allocate  the  buffer  fairly.  Flows  with  different 
end-to-end  congestion  control  algorithms  will  achieve  different  throughputs  even  if  routers  try  to 
fairly  allocate  the  buffer.  In  the  case  of  Figure  4.8(a),  all  sources  are  CBR,  i.e.,  none  are  adopting 
any  end-to-end  congestion  control  algorithms,  FRED  provides  performance  similar  to  CSEQ  and 
DRR.  In  the  case  of  Figure  4.8(b),  a  TCP  flow  is  competing  with  multiple  CBR  flows.  Since  the  TCP 
flow  slows  down  during  congestion  while  CBQ  does  not,  it  achieves  significantly  less  throughput 
than  a  competing  CBR  flow. 
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CBR-1  -  CBR-10  CBR-K1  -  CBR-K10 


CBR-1  CBR-10  CBR-11  CBR-20  CBR-K1  CBR-K10 


Figure  4.7:  Topology  for  analyzing  the  effects  of  multiple  congested  links  on  the  throughput  of  a  flow.  Each 
link  has  ten  cross  flows  (all  CBRs).  All  links  have  10  Mbps  capacities.  The  sending  rates  of  all  CBRs, 
excepting  CBR-0,  arc  2  Mbps,  which  leads  to  all  links  between  routers  being  congested. 


Number  of  Congested  Links  Number  of  Congested  Links 

(a)  (b) 

Figure  4.8:  (a)  The  normalized  throughput  of  CBR~0  as  a  function  of  the  number  of  congested  links,  (b) 
The  same  plot  when  CBR-0  is  replaced  by  a  TCP  flow. 

4.4,3  Coexistence  of  Different  Adaptation  Schemes 

In  this  experiment  we  investigate  the  extent  to  which  CSFQ  can  deal  with  flows  that  employ  different 
adaptation  schemes.  Receiver-driven  Layered  Multicast  (RLM)  [69]  is  an  adaptive  scheme  in  which 
the  source  sends  the  information  encoded  into  a  number  of  layers  (each  to  its  own  multicast  group) 
and  the  receiver  joins  or  leaves  the  groups  associated  with  the  layers  based  on  how  many  packet 
drops  it  is  experiencing.  We  consider  a  4  Mbps  link  traversed  by  one  TCP  and  three  RLM  flows. 
Each  source  uses  a  seven  layer  encoding,  where  layer  i  sends  2^^'^  Kbps;  each  layer  is  modeled  by 
a  CBR  traffic  source.  The  fair  share  of  each  flow  is  1Mbps.  In  the  RLM  case  this  will  correspond  to 
each  receiver  subscribing  to  the  first  five  layers.  ^ 

The  average  receiving  rates  averaged  over  1  sec  intervals  for  each  algorithm  are  plotted  in  Fig¬ 
ure  4.9.  We  have  conducted  two  separate  simulations  of  CSFQ."^  In  the  first  one,  we  have  used  the 
same  averaging  constant  as  in  the  rest  of  this  chapter:  K  =  100  ms,  and  =  200  ms.  Here,  one 

'^More  precisely,  we  have  XI  Li  “  0.992  Mbps. 

"^See  also  [69]  for  additional  simulations  of  RLM  and  TCP, 
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RLM  flow  does  not  get  its  fair  share  (it  is  one  layer  below  where  it  should  be).  We  think  this  is  due 
to  the  bursty  behavior  of  the  TCP  that  is  not  detected  by  CSFQ  soon  enough,  allowing  the  TCP  to 
opportunistically  grab  more  bandwidth  than  its  share  at  the  expense  of  less  aggressive  RLM  flows. 
To  test  this  hypothesis,  we  have  changed  the  averaging  time  intervals  to  K  —  20  ms,  and  =  40 
ms,  respectively,  which  result  in  TCP  flow  bandwidth  being  restricted  much  earlier.  As  shown  in 
Figure  4.9(d),  with  these  parameters  all  flows  receive  roughly  1  Mbps. 

An  interesting  point  to  notice  is  that  FRED  does  not  provide  fair  bandwidth  allocation  in  this 
scenario.  Again,  as  discussed  in  Section  4.4.2,  this  is  due  to  the  fact  that  RLM  and  TCP  use  different 
end-to-end  congestion  control  algorithms. 

Finally,  we  note  that  we  have  performed  two  other  similar  experiments  (not  included  here  due 
to  space  limitations):  one  in  which  the  TCP  flow  is  replaced  by  a  CBR  that  sends  at  4  Mbps,  and 
another  in  which  we  have  both  the  TCP  and  the  CBR  flows  together  along  with  the  three  RLM  flows. 
The  overall  results  were  similar,  except  that  in  both  experiments  all  flows  received  their  shares  under 
CSFQ  when  using  the  original  settings  for  the  averaging  intervals,  i.e.,  K  =  100  ms  and  K(y  =  200 
ms.  In  addition,  in  some  of  these  other  experiments  where  the  RLM  flows  are  started  before  the 
TCP,  the  RLM  flows  get  more  than  their  share  of  bandwidth  when  RED  and  FIFO  are  used. 

4.4,4  Different  Traffic  Models 

So  far  we  have  only  considered  CBR  and  TCP  traffic  sources.  We  now  look  at  two  additional 
source  models  with  greater  degrees  of  burstiness.  We  again  consider  a  single  10  Mbps  congested 
link.  In  the  first  experiment,  this  link  is  shared  by  one  ON-OFF  source  and  19  CBRs  that  send  at 
exactly  their  share,  0.5  Mbps.  The  ON  and  OFF  periods  of  the  ON-OFF  source  are  both  drawn  from 
exponential  distributions  with  means  of  200  ms  and  19*200  ms  respectively.  During  the  ON  period 
the  ON-OFF  source  sends  at  10  Mbps.  Note  that  the  ON-time  is  on  the  same  order  as  the  averaging 
inteiwal  K  =  200m.s  for  CSFQ's  rate  estimation  algorithm,  so  this  experiment  is  designed  to  test  to 
what  extent  CSFQ  can  react  over  short  time  scales. 


Algorithm 

delivered 

dropped 

DRR 

1080 

3819 

CSFQ 

1000 

3889 

FRED 

1064 

3825 

RED 

2819 

2080 

HFO 

3771 

1128 

Table  4.1 :  Statistics  for  an  ON-OFF  flow  with  19  Competing  CBRs  flows  (all  numbers  are  in  packets) 

The  ON-OFF  source  sent  4899  packets  over  the  course  of  the  experiment.  Table  4. 1  shows  the 
number  of  packets  from  the  ON-OFF  source  dropped  at  the  congested  link.  The  DRR  results  show 
what  happens  when  the  ON-OFF  source  is  restricted  to  its  fair  share  at  all  times.  FRED  and  CSFQ 
also  are  able  to  achieve  a  high  degree  of  fairness. 

Our  next  experiment  simulates  Web  traffic.  There  are  60  TCP  transfers  whose  inter-arrival  times 
are  exponentially  distributed  with  the  mean  of  0.1  ms,  and  the  length  of  each  transfer  is  drawn  from 
a  Pareto  distribution  with  a  mean  of  40  packets  (1  packet  =  1  KB)  and  a  shaping  parameter  of  1.06. 
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These  values  are  consistent  with  those  presented  by  Crovella  and  Bestavros  [27].  In  addition,  there 
is  a  single  10  Mbps  CBR  flow. 


Algorithm 

mean  time 

std.  dev 

DRR 

46.38 

197.35 

CSFQ 

88.21 

230.29 

FRED 

73.48 

272.25 

RED 

790.28 

1651.38 

RFO  1 

1736.93 

1826.74 

Table  4.2:  The  mean  transfer  times  (in  ms)  and  the  corresponding  standard  deviations  for  60  short  TCPs  in 
the  presence  of  a  CBR  flow  that  sends  at  the  link  capacity,  i.e.,  10  Mbps. 

Table  4.2  presents  the  mean  transfer  time  and  the  corresponding  standard  deviations.  Here, 
CSFQ  and  FRED  do  less  well  than  DRR,  but  one  order  of  magnitude  better  than  FIFO  and  RED. 

4.4.5  Large  Latency 

All  of  our  experiments  so  far  have  had  minimal  latencies.  In  this  experiment  we  again  consider  a 
single  10  Mbps  congested  link,  but  now  the  flows  have  propagation  delays  of  100  ms  in  getting  to 
the  congested  link.  The  load  is  comprised  of  one  CBR  that  sends  at  the  link  capacity  and  19  TCP 
flows.  Table  4.3  shows  the  mean  number  of  packets  forwarded  for  each  TCP  flow  during  a  100  sec 
time  interval.  CSFQ  and  FRED  both  perform  reasonably  well. 

4.4.6  Packet  Relabeling 

Recall  that  when  the  dropping  probability  P  of  a  packet  is  non-zero  we  relabel  it  with  a  new  label 
where  Lmw  =  (1  -  P)Loid  so  that  the  label  of  the  packet  will  reflect  the  new  rate  of  the  flow.  To 
test  how  well  this  works  in  practice,  we  consider  the  topology  in  Figure  4. 10,  where  each  link  is  10 
Mbps.  Note  that  as  long  as  all  three  flows  attempt  to  use  their  full  fair  share,  the  fair  shares  of  flows  1 
and  2  are  less  on  link  2  (3.33  Mbps)  than  on  link  1  (5  Mbps),  so  there  will  be  dropping  on  both  links. 
This  will  test  the  relabeling  function  to  make  sure  that  the  incoming  rates  are  accurately  reflected  on 
the  second  link.  We  perform  two  experiments  (only  looking  at  CSFQ’s  performance).  In  the  first, 
there  are  three  CBRs  sending  data  at  10  Mbps  each.  Table  4.4  shows  the  average  throughputs  over 


Algorithm 

mean 

std.  dev 

DRR 

5857.89 

192.86 

CSFQ 

5135.05 

175.76 

FRED 

4967.05 

261.23 

RED 

628,10 

80.46 

HFO 

379.42 

68.72 

Table  4.3:  The  mean  throughput  (in  packets)  and  standard  deviation  for  19  TCPs  in  the  presence  of  a  CBR 
flow  along  a  link  with  propagation  delay  of  100  ms.  The  CBR  sends  at  the  link  capacity  of  10  Mbps. 
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Source 

FIOW1  ’Mbps 

(10  Mbps) 

Flow  2 
(10  Mbps) 


Flow  3 
(10  Mbps) 


Linkl 
(10  Mbps) 


Link  2  Sink 
(10  Mbps), 


Figure  4. 10:  Simulation  scenario  for  the  packet  relabeling  experiment.  Each  link  has  10  Mbps  capacity,  and 
a  propagation  delay  of  1  ms. 


Traffic 

Flow  1 

Flow  2 

Flow  3 

CBR 

3.267 

3.262 

3.458 

TCP 

3.232 

3.336  1 

3.358 

Table  4.4:  The  throughputs  resulting  from  CSFQ  averaged  over  10  sec  for  the  three  flows  in  Figure  4.10 
along  link  2. 

10  sec  of  the  three  CBR  flows.  As  expected,  these  rates  are  closed  to  3.33  Mbps.  In  the  second 
experiment,  we  replace  the  three  CBRs  by  three  TCPs.  Again,  despite  the  TCP  burstiness  which 
may  negatively  affect  the  rate  estimation  and  relabeling  accuracy,  each  TCP  gets  its  fair  share. 

4.4.7  Discussion  of  Simulation  Results 

We  have  tested  CSFQ  under  a  wide  range  of  conditions;  conditions  purposely  designed  to  stress 
its  ability  to  achieve  fair  allocations.  These  tests,  and  the  others  we  have  run  but  cannot  show 
here  because  of  space  limitations,  suggest  that  CSFQ  achieves  a  reasonable  approximation  of  fair 
bandwidth  allocations  in  most  conditions.  Certainly  CSFQ  is  far  superior  in  this  regard  to  the  status 
quo  (FIFO  or  RED).  Moreover,  in  all  situations  CSFQ  is  roughly  comparable  with  FRED,  and  in 
some  cases  it  achieves  significantly  fairer  allocations.  Recall  that  FRED  requires  per-packet  flow 
classification  while  CSFQ  does  not,  so  we  are  achieving  these  levels  of  fairness  in  a  more  scalable 
manner.  However,  there  is  clearly  room  for  improvement  in  CSFQ.  We  think  our  buffer  management 
algorithm  may  not  be  well-tuned  to  the  vagaries  of  TCP  buffer  usage,  and  so  are  currently  looking 
at  adopting  an  approach  closer  in  spirit  to  RED  for  buffer  management,  while  retaining  the  use  of 
the  labels  to  achieve  fairness. 


4.5  Related  Work 

An  alternative  to  the  allocation  approach  was  recently  proposed  to  address  the  problem  of  the  ill- 
behaved  (unfriendly)  flows.  This  approach  is  called  the  identification  approach  and  it  is  best  ex¬ 
emplified  by  Floyd  and  Fall  [36].  In  this  approach,  routers  use  a  lightweight  detection  algorithm 
to  identify  unfriendly  flows,  and  then  explicitly  manage  the  bandwidth  of  these  unfriendly  flows. 
This  bandwidth  management  can  range  from  merely  restricting  unfriendly  flows  to  no  more  than 
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Algorithm 

Simulation  1 

Simulation  2 

UDP 

TCP-1 

TCP-2 

TCP-1 

TCP-2 

REDI 

0.906 

0.280 

0.278 

0.565 

0.891 

CSFQ 

0.554 

0.468 

0.478 

0.729 

0.747 

Table  4.5:  Simulation  1  -  The  throughputs  in  Mbps  of  one  UDP  and  two  TCP  flows  along  a  1.5  Mbps  link 
under  REDI  [36]  and  CSFQ,  respectively.  Simulation  2  -  The  throughputs  of  two  TCPs  (where  TCP-2  opens 
its  congestion  window  three  times  faster  than  TCP-1),  under  REDI  and  CSFQ,  respectively. 


the  currently  highest  friendly  flow's  share  to  the  extreme  of  severely  punishing  unfriendly  flows  by 
dropping  all  of  their  packets. 

Compared  to  CSFQ,  the  solution  to  implementing  the  identification  approach  described  by  Floyd 
and  Fall  [36]  has  several  drawbacks.  First,  this  solution  requires  routers  to  maintain  state  for  each 
flow  that  has  been  classified  as  un-friendly.  In  contrast,  CSFQ  does  not  require  core  routers  to 
maintain  any  per  flow  state.  Second,  designing  accurate  identification  tests  for  unfriendly  flows  is 
inherently  difficult.  Finally,  the  identification  approach  requires  that  all  flows  to  implement  a  similar 
congestion  control  mechanism,  i.e.,  to  be  TCP  friendly.  We  believe  that  this  is  overly  restrictive  as  it 
severely  limits  the  freedom  of  designing  new  congestion  protocols  that  best  suit  application  needs. 

Next,  we  discuss  in  more  detail  the  difficulty  of  providing  accurate  identification  tests  for  un¬ 
friendly  flows.  One  can  think  of  the  process  of  identifying  unfriendly  flows  as  occurring  in  two 
logically  distinct  stages.  The  first  and  relatively  easy  step  is  to  estimate  the  anival  rate  of  a  flow. 
The  second,  and  harder,  step  is  to  use  this  arrival  rate  information  (along  with  the  dropping  rate  and 
other  aggregate  measurements)  to  decide  if  the  flow  is  unfriendly.  Assuming  that  friendly  flows  use 
a  TCP-like  adjustment  method  of  increase-by-one  and  decrease-by-half,  one  can  derive  an  expres¬ 
sion  (see  [36]  for  details)  for  the  bandwidth  share  5  as  a  function  of  the  dropping  rate  p,  round-trip 
time  i?,  and  packet  size  B:  S  ^  for  some  constant  7.  Because  routers  do  not  know  the  round 
trip  time  R  of  flows,  they  use  the  lower  bound  of  double  the  propagation  delay  of  the  attached  link. 
Unfortunately,  this  allows  flows  further  away  from  the  link  to  behave  more  aggressively  without 
being  identified  as  being  unfriendly. 

To  see  how  this  occurs  in  practice,  consider  the  following  two  experiments  using  the  identifica¬ 
tion  algorithm  described  by  Floyd  and  Fall  [36],  which  we  call  RED  with  Identification  (REDI).^ 
In  each  case  there  are  multiple  flows  traversing  a  1.5  Mbps  link  with  a  latency  of  3  ms;  the  output 
buffer  size  is  32  KB  and  all  constants  K,  Ka,  and  Kc,  respectively,  are  set  to  400  ms.  Table  4.5 
shows  the  bandwidth  allocations  under  REDI  and  CSFQ  averaged  over  100  sec.  In  the  first  experi¬ 
ment  (Simulation  1),  we  consider  a  1  Mbps  UDP  flow  and  two  TCP  flows;  in  the  second  experiment 
(Simulation  2)  we  have  a  standard  TCP  (TCP-1)  and  a  modified  TCP  (TCP-2)  that  opens  the  con¬ 
gestion  window  three  times  faster.  In  both  cases  REDI  fails  to  identify  the  unfriendly  flow,  allowing 
it  to  obtain  almost  two-thirds  of  the  bandwidth.  As  we  increase  the  latency  of  the  congested  link, 
REDI  starts  to  identify  unfriendly  flows.  However,  for  some  values  as  high  as  18  ms,  it  still  fails  to 
identify  such  flows.  Thus,  the  identification  approach  still  awaits  a  viable  realization  and,  as  of  now, 

^We  are  grateful  to  Sally  Floyd  who  provided  us  her  script  implementing  the  REDI  algorithm.  We  used  a  similar  script 
in  our  simulation,  with  the  understanding  that  this  is  a  preliminary  design  of  the  identification  algorithm.  Our  contention 
is  that  the  design  of  such  an  identification  algorithm  is  fundamentally  difficult  due  to  the  uncertainty  of  RTT. 
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the  allocation  approach  is  the  only  demonstrated  method  able  to  deal  with  the  problem  of  unfriendly 
flows. 

The  problem  of  estimating  fair-share  rate  has  also  been  studied  in  the  context  of  designing 
Available  Bit  Rate  (ABR)  algorithms  for  ATM  networks.  While  the  problems  are  similar,  there  arc 
also  several  important  differences.  First,  in  ATM  ABR,  the  goal  is  to  provide  explicit  feedback  to 
end  systems  for  policing  purposes  so  that  cell  loss  inside  the  network  can  be  prevented.  In  CSFQ, 
there  is  no  explicit  feedback  and  edge  policing.  Packets  from  a  flow  may  arrive  at  a  much  higher 
rate  than  the  flow's  fair  share  rate  and  the  goal  of  CSFQ  is  to  ensure,  by  probabilistic  dr*opping,  that 
such  flows  do  not  get  more  serwice  than  their  shar'es.  Second,  since  ATM  alr^eady  keeps  per  virtual 
circuit  (VC)  state,  additional  per  VC  state  is  usually  added  to  improve  the  accuracy  and  reduce  the 
time  complexity  of  estimating  the  fair  shar*e  rate  [20,  33,  61].  However,  there  ar*e  several  algorithms 
that  try  to  estimate  the  fair  share  without  keeping  per  flow  state  [87,  95].  These  algorithms  rely  on 
the  flow  rates  communicated  by  the  end-system.  These  estimates  are  assumed  to  remain  accurate 
over  multiple  hops,  due  to  the  accurate  explicit  congestion  contr*ol  provided  by  ABR.  In  contrust,  in 
CSFQ,  since  the  number  of  dropped  packets  cannot  be  neglected,  the  flow  rates  ai'e  re-computed  at 
each  router,  if  needed  (see  Section  4.3.2. 3).  In  addition,  the  estimation  algorithms  are  themselves 
quite  different.  While  the  algorithms  in  averaging  over  the  flow  rates  communicated  by  the  end- 
systems,  CSFQ  uses  linear  interpolation  in  conjunction  with  exponentially  averaging  the  traffic 
aggregates  at  the  router.  Our  preliminaiy  analysis  and  evaluation  show  that  our  estimation  algorithm 
is  more  suited  for  our  context. 


4.6  Summary 

In  this  chapter  we  have  presented  a  solution  that  achieves  fair  bandwidth  allocation,  without  re¬ 
quiring  core  routers  to  maintain  any  per-flow  state.  The  key  idea  is  to  use  the  DPS  technique  to 
approximate  the  service  provided  by  a  reference  network  - —  in  which  every  node  implements  Fair 
Queueing  -  within  a  SCORE  network.  Each  node  in  the  SCORE  network  implements  a  novel  al¬ 
gorithm,  called  Core-Stateless  Fair  Queueing  (CSFQ).  With  CSFQ  edge  routers  estimate  flow  rates 
and  insert  them  into  the  packet  headers.  Core  routers  simply  perform  probabilistic  dropping  on  in¬ 
put  based  on  these  labels  and  an  estimate  of  the  fair  share  rate,  the  computation  of  which  requires 
only  aggregate  measurements.  Packet  labels  are  rewritten  by  the  core  routers  to  reflect  output  rates, 
so  this  approach  can  handle  multi-hop  situations. 

We  have  tested  CSFQ  and  several  other  algorithms  under  a  wide  variety  of  conditions.  We 
have  found  that  CSFQ  achieves  a  significant  degree  of  fairness  in  all  of  these  circumstances.  While 
not  matching  the  fairness  benchmark  of  DRR,  it  is  comparable  or  superior  to  FRED,  and  vastly 
better  than  the  baseline  cases  of  RED  and  FIFO.  We  know  of  no  other  approach  that  can  achieve 
comparable  levels  of  fairness  without  any  per-flow  operations  in  the  core  routers. 
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Chapter  5 


Providing  Guaranteed  Services  in  SCORE 


In  the  previous  chapter  we  demonstrated  that  using  the  SCORE/DPS  framework  makes  it  possible  to 
implement  a  service  with  a  per  flow  semantic  (i.e.,  per  flow  isolation)  in  a  stateless  core  architecture. 
In  this  chapter,  we  present  a  second  example  which  shows  that  it  is  possible  to  provide  per  flow 
bandwidth  and  delay  guarantees  in  a  core  stateless  network.  We  achieve  this  goal  by  using  the  DPS 
technique  to  implement  the  functionalities  required  by  the  Guaranteed  service  [93]  on  both  the  data 
and  control  paths  in  a  SCORE  network. 

The  rest  of  this  chapter  is  organized  as  follows.  The  next  section  presents  the  motivation  behind 
the  Guaranteed  service  and  describes  the  problems  with  the  existing  solutions.  Section  5.2  outlines 
our  solution  to  implement  the  Guaranteed  service  in  a  SCORE  network.  Sections  5.3  and  5.4  present 
details  of  both  our  data  and  control  path  algorithms.  Section  5.5  describes  a  design  and  a  prototype 
implementation  of  the  proposed  algorithms  in  IPv4  networks.  Finally,  Section  5.6  describes  related 
work,  and  Section  5.7  summarizes  our  findings. 


5.1  Background 

As  new  applications  such  as  IP  telephony,  video-conferencing,  audio  and  video  streaming,  and 
distributed  games  are  deployed  in  the  Internet,  there  is  a  growing  need  to  support  more  sophisticated 
services  than  the  best-effort  service.  Unlike  traditional  applications  such  as  file  transfer,  these  new 
applications  have  much  stricter  timeliness  and  bandwidth  requirements.  For  example,  in  order  to 
provide  a  quality  comparable  to  today’s  telephone  service,  the  end-to-end  delay  should  not  exceed 
100  ms  [64].  Since  in  a  global  network  the  propagation  delay  alone  is  about  100  ms,  meeting  such 
tight  delay  requirements  is  a  challenging  task  [7],  Similarly,  to  provide  high  quality  video  and  audio 
broadcasting,  it  is  desirable  to  be  able  to  ensure  both  bandwidth  and  delay  guarantees. 

To  support  these  new  applications,  the  IETF  has  proposed  two  service  models:  the  Guaranteed 
service  [93]  defined  in  the  context  of  Intserv  [82],  and  the  Premium  service  [76]  defined  in  the 
context  of  Dilfserv  [32].  These  services  have  important  differences  in  both  their  semantic  and  im¬ 
plementation  complexity.  At  the  service  definition  level,  while  the  Guaranteed  service  can  provide 
both  per  flow  delay  and  bandwidth  guarantees  [93],  the  Premium  service  can  provide  only  per  flow 
bandwidth  and  per  aggregate  delay  guarantees  [76].  Thus,  with  the  Premium  service,  if  two  flows 
have  different  delay  requirements,  say  d\  and  d^,  the  only  way  to  meet  both  these  delay  require¬ 
ments  is  to  ensure  a  delay  of  d  —  min(di ,  ^2)  to  both  flows.  The  main  drawback  of  this  approach  is 
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that  it  can  result  in  very  low  resource  utilization  for  the  premium  traffic.  In  particular,  as  we  show  in 
Appendix  B.l,  even  if  the  fraction  that  can  be  allocated  to  the  premium  traffic  on  eveiy  link  in  the 
network  is  veiy  low  (e.g.,  10%),  the  queueing  delay  across  a  large  network  (e.g.,  15  routers)  can  be 
relative  large  (e.g.,  240  ms).  In  contrast,  the  Guaranteed  service  can  achieve  both  higher  resource 
utilization  and  tighter  delay  bounds,  by  better  matching  flow  requirements  to  resource  usage. 

At  the  implementation  level,  cunent  solutions  to  provide  guaranteed  services  require  each  router 
to  process  per  flow  signaling  messages  and  maintain  per  flow  state  on  the  control  path,  and  to 
perform  per  flow  classification,  scheduling,  and  buffer  management  on  the  data  path.  Performing 
per  flow  management  inside  the  network  affects  both  the  network  scalability  and  robustness.  The 
former  is  because  the  per  flow  operation  complexity  usually  increases  as  a  function  of  the  number 
of  flows;  the  latter  is  because  it  is  difficult  to  maintain  the  consistency  of  dynamic,  and  replicated 
per  flow  state  in  a  distributed  network  environment.  While  there  are  several  proposals  that  aim  to 
reduce  the  number  of  flows  inside  the  network,  by  aggregating  micro-flows  that  follow  the  same 
path  into  one  macro-flow  [5,  49],  they  only  alleviate  this  problem,  but  do  not  fundamentally  solve 
it.  This  is  because  the  number  of  macro  flows  can  still  be  quite  large  in  a  network  with  many  edge 
routers,  as  the  number  of  paths  is  a  quadratic  function  of  the  number  of  edge  nodes. 

In  contrast,  the  Premium  service  is  implemented  in  the  context  of  the  Diffserv  architecture 
which  distinguishes  between  edge  and  core  routers.  While  edge  routers  process  packets  on  a  per 
flow  or  per  aggregate  basis,  core  routers  do  not  maintain  any  fine  grained  state  about  the  traffic; 
they  simply  forward  premium  packets  based  on  a  high  priority  bit  set  in  the  packet  headers  by 
edge  routers.  Pushing  the  complexity  to  the  edge  and  maintaining  a  simple  core  makes  the  data 
plane  highly  scalable.  However,  the  Premium  service  still  requires  admission  control  on  the  control 
path.  One  proposal  is  to  use  a  centralized  bandwidth  broker  that  maintains  the  topology  as  well 
as  the  state  of  all  nodes  in  the  network.  In  this  case,  the  admission  control  can  be  implemented 
by  the  broker,  eliminating  the  need  for  maintaining  distributed  reservation  state.  However,  such  a 
centralized  approach  is  more  appropriate  for  an  environment  where  most  flows  are  long  lived,  and 
set-up  and  tear-down  events  are  rare. 

In  summary,  the  Guaranteed  service  is  more  powerful  but  has  serious  limitations  with  respect 
to  network  scalability  and  robustness.  On  the  other  hand,  the  Premium  service  is  more  scalable, 
but  cannot  achieve  the  levels  of  flexibility  and  utilization  of  the  Guaranteed  service.  In  addition, 
scalable  and  robust  admission  control  for  the  Premium  service  is  still  an  open  research  problem. 

In  this  chapter  we  show  that  by  using  the  SCORE/DPS  framework  we  can  achieve  the  best 
of  the  two  worlds:  provide  Guaranteed  service  semantic  while  maintaining  the  scalability  and  the 
robustness  of  the  Diffserv  architecture. 


5.2  Solution  Outline 

Current  solutions  to  implement  the  Guaranteed  service  assume  a  stateful  network  in  which  each 
router  maintains  per  flow  state.  The  state  is  used  by  both  the  admission  control  module  in  the 
control  plane  and  the  classifier  and  scheduler  in  the  data  plane. 

In  this  chapter,  we  propose  scheduling  and  admission  control  algorithms  that  provide  the  Guar¬ 
anteed  service  but  do  not  require  core  routers  to  maintain  per  flow  state.  The  main  idea  behind  our 
solution  is  to  approximate  a  reference  stateful  network  that  provides  the  Guaranteed  service  in  a 
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Figure  5.1:  (a)  A  reference  stateful  network  that  provides  the  Guaranteed  service  [93].  Each  node  imple¬ 
ments  the  Jitter- Virtual  Clock  (Jitter-VC)  algorithm  on  the  data  path,  and  per  flow  admission  control  on  the 
control  path,  (b)  A  SCORE  network  that  emulates  the  service  provided  by  the  reference  network.  On  the  data 
path,  each  node  approximates  Jitter-VC  with  a  new  algorithm,  called  Core- Jitter  Virtual  Clock  (CJVC).  On 
the  control  path  each  node  approximates  per  flow  admission  control. 

SCORE  network  (see  Figure  5.1).  The  key  technique  used  to  implement  these  algorithms  is  Dy¬ 
namic  Packet  State  (DPS).  On  the  control  path,  our  solution  aims  to  emulate  per  flow  admission 
control,  while  on  the  data  path,  our  algorithm  aims  to  emulate  a  reference  network  in  which  every 
node  implements  the  Delay- Jitter-Controlled  Virtual  Clock  (Jitter-VC)  algorithm. 

Among  many  scheduling  disciplines  that  can  implement  the  Guaranteed  service  we  chose  Jitter- 
VC  for  several  reasons.  First,  unlike  various  Fair  Queueing  algorithms  [31,  79],  in  which  a  packet’s 
deadline  can  depend  on  state  variables  of  all  active  flows,  in  Jitter-VC  a  packet’s  deadline  depends 
only  on  the  state  variables  of  the  flow  it  belongs  to.  This  property  of  Jitter-VC  makes  the  algorithm 
easier  to  approximate  in  a  SCORE  network.  In  particular,  the  fact  that  packet’s  deadline  can  be 
computed  exclusively  based  on  the  state  variables  of  the  flow  it  belongs  to,  makes  it  possible  to 
eliminate  the  need  to  replicate  and  maintain  per  flow  state  at  all  nodes  along  the  path.  Instead,  per 
flow  state  can  be  stored  only  at  the  ingress  node,  inserted  into  the  packet  header  by  the  ingress  node, 
and  retrieved  later  by  core  nodes,  which  then  use  it  to  determine  the  packet’s  deadline.  Second,  by 
regulating  traffic  inside  the  network  using  delay-jitter-controllers  (discussed  below),  it  can  be  shown 
that  with  very  high  probability,  the  number  of  packets  in  the  server  at  any  given  time  is  significantly 
smaller  than  the  number  of  flows  (see  Section  5.3.3).  This  helps  to  simplify  the  scheduler. 

In  the  next  section,  we  present  techniques  that  eliminate  the  need  for  data  plane  algorithms  to 
use  per  flow  state  at  core  nodes.  In  particular,  at  core  nodes,  packet  classification  is  no  longer  needed 
and  packet  scheduling  is  based  on  the  state  carried  in  packet  headers,  rather  than  per  flow  state  stored 
locally  at  each  node.  In  Section  5.4,  we  will  show  that  fully  distributed  admission  control  can  also 
be  achieved  without  the  need  for  maintaining  per  flow  state  at  core  nodes. 


5.3  Data  Plane:  Scheduling  Without  Per  Flow  State 

In  this  section,  we  first  describe  Jitter-VC,  which  is  used  to  achieve  guaranteed  services  in  the 
reference  network,  and  then  present  our  algorithm,  called  Core- Jitter- VC  (CJVC).  CJVC  uses  the 
Dynamic  Packet  State  (DPS)  technique  to  emulate  Jitter-VC  in  a  SCORE  network.  In  Appendix  B.3 
we  present  an  analysis  to  show  that  a  network  of  routers  implementing  CJVC  provides  the  same 
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delay  bound  as  a  network  of  routers  implementing  the  Jitter-VC  algorithm. 


5.3.1  Jitter  Virtual  Clock  (Jitter-VC) 

Jitter-VC  is  a  non-work-conserving  version  of  the  Virtual  Clock  algorithm  [127].  It  uses  a  combina¬ 
tion  of  a  delay-jitter  rate-controller  [112,  126]  and  a  Virtual  Clock  scheduler.  The  algorithm  works 
as  follows:  each  packet  is  assigned  an  eligible  time  and  a  deadline  upon  its  anival.  The  packet  is 
held  in  the  rate-controller  until  it  becomes  eligible,  i.e.,  the  system  time  exceeds  the  packet’s  eligible 
time  (see  Figure  5.2(a)).  The  scheduler  then  orders  the  transmission  of  eligible  packets  according 
to  their  deadlines. 


Notation 

Comments 

p‘i 

the  A'-th  packet  of  flow  / 

If 

length  of  p- 

arrival  time  of  pj  at  node  j 

- 

time  when  p^  was  sent  by  node  j 

eligible  time  of  pf  at  node  j 

(l^-  ■ 

deadline  of  p-  at  node  j 

time  ahead  of  schedule:  gfi  —  -f  tj  ~ 

6f 

slack  delay  of  pf 

propagation  delay  between  nodes  j  and  7  +  1 

transmission  time  of  a  maximum  size  packet  at  node  j 

Table  5.1:  Notations  used  in  Section  5.3. 

For  the  packet  of  flow  its  eligible  time,  and  deadline,  df  at  the  node  on  its  path 
are  computed  as  follows: 


e^- 

—  max(a^j 

(5.1) 

^7  ~  ■ 

•  i 

(5.2) 

where  if  is  the  length  of  the  packet,  Vj  is  the  reserved  rate  for  the  flow,  afj  is  the  packet’s  arrival  time 
at  the  node  traversed  by  the  packet,  and  gf-,  stamped  into  the  packet  header  by  the  previous  node, 
is  the  amount  of  time  the  packet  was  transmitted  before  its  schedule,  i.e.,  the  difference  between  the 
packet’s  deadline  and  its  actual  departure  time  at  node  j  —  I-  To  account  for  the  fact  that  the  packet 
transmission  is  not  preemptive,  and  as  a  result  a  packet  can  miss  its  deadline  by  the  time  it  takes  to 
transmit  a  packet  of  maximum  size,  tj  [127],  we  inflate  the  packet  delay  by  Tj  when  computing  gf  j 
(see  Table  5.1). 

Intuitively,  the  algorithm  eliminates  the  delay  variation  of  different  packets  by  forcing  all  packets 
to  incur  the  maximum  allowable  delay.  The  purpose  of  having  gfj_i  is  to  compensate  at  node  j 
the  variation  of  delay  due  to  load  fluctuation  at  the  previous  node,  j  ~  1.  Such  regulations  limit 
the  traffic  burstiness  caused  by  network  load  fluctuations,  and  as  a  consequence,  reduce  both  buffer 
space  requirements  and  the  scheduler  complexity. 
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It  has  been  shown  that  if  a  flow’s  long  term  arrival  rate  is  no  greater  than  its  reserved  rate,  a 
network  of  Virtual  Clock  servers  can  provide  the  same  delay  guarantee  to  the  flow  as  a  network 
of  Weighted  Fair  Queueing  (WFQ)  servers  [35,  47,  100].  In  addition,  it  has  been  shown  that  a 
network  of  Jitter- VC  servers  can  provide  the  same  delay  guarantees  as  a  network  of  Virtual  Clock 
servers  [28, 42].  Therefore,  a  network  of  Jitter-VC  servers  can  provide  the  same  guaranteed  service 
as  a  network  of  WFQ  servers. 

5.3.2  Core- Jitter-VC  (CJVC) 

In  this  section  we  propose  a  variant  of  Jitter-VC,  called  Core-Jitter-VC  (CJVC),  which  does  not 
require  per  flow  state  at  core  nodes.  In  addition,  we  show  that  a  network  of  CJVC  servers  can 
provide  the  same  guaranteed  service  as  a  network  of  Jitter-VC  servers. 

CJVC  uses  the  DPS  technique.  The  key  idea  is  to  have  the  ingress  node  to  encode  scheduling 
parameters  in  each  packet’s  header.  The  core  routers  can  then  make  scheduling  decisions  based 
on  the  paiameters  encoded  in  packet  headers,  thus  eliminating  the  need  for  maintaining  per  flow 
state  at  core  nodes.  As  suggested  by  Eqs.  (5.1)  and  (5.2),  the  Jitter-VC  algorithm  needs  two  state 
variables  for  each  flow  v.  rj,  which  is  the  reserved  rate  for  flow  i  and  df j,  which  is  the  deadline  of 
the  last  packet  from  flow  i  that  was  served  by  node  j.  While  it  is  straightforward  to  eliminate  ry  by 
putting  it  in  the  packet  header,  it  is  not  trivial  to  eliminate  d\y  The  difference  between  r*  and 
is  that  while  all  nodes  along  the  path  keep  the  same  value  for  flow  i,  dJ-  is  a  dynamic  value  that 
is  computed  iteratively  at  each  node.  In  fact,  the  eligible  time  and  the  deadline  of  pf  depend  on  the 
deadline  of  the  previous  packet  of  the  same  flow,  i.e.,  d^  J^. 

A  naive  implementation  using  the  DPS  technique  would  be  to  precompute  the  eligible  times 
and  the  deadlines  of  the  packet  at  all  nodes  along  its  path  and  insert  all  of  them  in  the  header.  This 
would  eliminate  the  need  for  core  nodes  to  maintain  d^  j.  The  main  disadvantage  of  this  approach 
is  that  the  amount  of  information  carried  by  the  packet  increases  with  the  number  of  hops  along  the 
path.  The  challenge  then  is  to  design  algorithms  that  compute  dfj  for  all  nodes  while  requiring  a 
minimum  amount  of  state  in  the  packet  header. 

Notice  that  in  Eq.  (5.1),  the  reason  for  node  j  to  maintain  dfj  is  that  it  will  be  used  to  compute 
the  deadline  and  the  eligible  time  of  the  next  packet.  Since  it  is  only  used  in  a  max  operation,  we 
can  eliminate  the  need  for  d^  4  if  we  can  ensure  that  the  other  term  in  max  is  never  less  than  df ,  . 
The  key  idea  is  then  to  use  a  slack  variable  associated  with  each  packet,  denoted  ,  such  that  for 
every  core  node  j  along  the  path,  the  following  holds 

i>i  (5-3) 

By  replacing  the  first  term  of  max  in  Eq.  (5.1)  with  a^j  +  9ij-i  +  the  computation  of  the 
eligible  time  reduces  to 


+  Oij-i  +  i  >  1 


(5.4) 


Therefore,  by  using  one  additional  DPS  variable,  ,  we  eliminate  the  need  for  maintaining  at 
the  core  nodes. 
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Figure  5.2:  The  time  (diagram  of  the  first  two  packets  of  flow  i  along  a  four  node  path  under  (a)  Jitter-VC, 
and  (b)  CJVC,  respectively.  Propagation  times,  TTj,  and  transmission  times  of  maximum  size  packets,  tj,  are 
ignored. 


The  (derivation  of  Sf  proceeds  in  two  steps.  First,  we  express  the  eligible  time  of  packet  ;>■'  at 
an  arbitral^  core  node  j,  as  a  function  of  the  eligible  time  of  pf  at  the  ingress  node,  cf ,,  (see 
Eq.  (5.7)).  Second,  we  use  this  result  and  Eq.  (5.4)  to  derive  a  lower  bound  for  Sf . 

We  now  proceed  with  the  first  step.  Recall  that  gf  represents  the  time  by  which  p^'  is  trans¬ 
mitted  before  its  schedule  at  node  j  —  1,  i.e.,  +  Tj„i  —  where  rj-i  is  the  maximum 

time  by  which  a  packet  can  miss  its  deadline  at  node  j  —  1.  Let  tt^-i  denote  the  propagation  delay 
between  nodes  j  —  1  and  j.  Then  the  anival  time  of  pj  at  node  j,  af  is  given  by 


-  J 


~  +  7r?~i  +  r/„i 

=  -  aij-\  +  TTj-l  +  Tj-]. 


(5.5) 


By  replacing  given  by  the  above  expression,  in  Eq.  (5.4),  and  then  using  Eq.  (5.2),  we  obtain 


—  A,j-\  +  A  +  '^j-i  +  "r?- 
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By  iterating  over  the  above  equation  we  express  ej  ^  as  a  function  of  ejj : 

j-i 


fj  —  4,1  +  (.?  “  1)  ^  ('^ni  +  Till):  i  >  1 

\'’  /  111  =  1 


(5.7) 


We  are  now  ready  to  compute  Sf .  Recall  that  the  goal  is  to  compute  the  minimum  which  ensures 
that  Eq.  (5.3)  holds  for  every  node  along  the  path.  After  combining  Eq.  (5.3),  Eq.  (5.4)  and  Eq.  (5.2) 
this  reduces  to  ensure  that 


'k  —  \ 
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(5.8) 


By  plugging  ef ,  and  ef  ^  as  expressed  by  Eq.  (5.7)  into  Eq.  (5.8),  we  get 
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(5.9) 
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ingress  node 

on  packet  p  arrival 
i  =  get^floiv{p)\ 
if  (first-packet_of  Jlow(p,  i)) 
e-i  =  current 
=  0; 

else 

5i  =  max(0,(5/  +  (//  -  length{p))lri~ 

md.yi{current±ime  —  —  1));  /*  Eq.  (5.10)  */ 

a  =  md.yi{currentJi7ne^di)\ 

/y  =  length{p)\ 
dj  =  6/  +  li/ri\ 
on  packet  p  transmission 

label{p)  <-  {ri,dj  —  cur  rent. time  ^  6  j)\ 
core/egress  node 
on  packet  p  arrival 
{r,9,S)  ^  label (p); 

e  =  current. tim.e  +  g  +  6;  /*  Eq.  (5 A)  */ 
d  —  e-\-  length{p) jr 

on  packet  p  transmission 
if  (core  node) 

label  {p)  ^  {r,d~  cur  rent. time  ^  5); 
else  / *  this  is  an  egress  node  */ 
clear  Jabel{p); 


Figure  53:  Algorithms  performed  by  ingress,  core,  and  egress  nodes  at  the  packet  arrival  and  departure. 
Note  that  core  and  egress  nodes  do  not  maintain  per  flow  state. 


From  Eqs.  (5.1)  and  (5.2)  we  have  ef  ^  ^  ^  Thus,  the  right-hand  side  term  in 

Eq.  (5.9)  is  maximized  when  j  =  /i.  As  a  result  we  compute  Sf  as 


0, 


max 


pfc-i 


k-l 


In 


h-i 


(5.10) 

k  >  l^h> 


In  this  way,  CJVC  ensures  that  the  eligible  time  of  every  packet,  at  node  j  is  no  smaller 
than  the  deadline  of  the  previous  packet  of  the  same  flow  at  node  j,  i.e.,  elj  >  dlj  ^  In  addition, 
the  Virtual  Clock  scheduler  ensures  that  the  deadline  of  every  packet  is  not  missed  by  more  than 
Tj  [127]. 

In  Appendix  B.2,  we  have  shown  that  a  network  of  CJVC  servers  provides  the  same  worst  case 
delay  bounds  as  a  network  of  Jitter-VC  servers.  More  precisely,  we  have  proven  the  following 
property. 


Theorem  2  The  deadline  of  a  packet  at  the  last  hop  in  a  network  of  CJVC  servers  is  equal  to  the 
deadline  of  the  same  packet  in  a  corresponding  network  of  Jitter-VC  servers. 
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The  example  in  Figure  5.2  provides  some  intuition  behind  the  above  property.  The  basic  obser¬ 
vation  is  that,  with  Jitter-VC,  not  counting  the  propagation  delay,  the  difference  between  the  eligible 
time  of  packet  pf  at  node  j  and  its  deadline  at  the  previous  node,  j  ~~  1,  i.e.,  cf  ^  never 

decreases  as  the  packet  propagates  along  the  path.  Consider  the  second  packet  in  Figure  5.2.  With 
Jitter-VC,  the  differences  ef-  —  (represented  by  the  bases  of  the  gray  triangles)  increase  in  j. 
By  introducing  the  slack  variable  Sf,  CJVC  equalizes  these  delays.  While  this  change  may  increase 
the  delay  of  the  packet  at  intermediate  hops,  it  does  not  affect  the  end-to-end  delay  bound. 

Figure  5.3  shows  the  computation  of  the  scheduling  parameters  e^  -  and  r/f  y  by  a  CJVC  server. 
The  number  of  hops  h  is  computed  at  the  admission  time  as  discussed  in  Section  5.4. 1 . 

5.3.3  Data  Path  Complexity 

While  our  algorithms  do  not  maintain  per  flow  state  at  core  nodes,  there  is  still  the  need  for  core 
nodes  to  perfonu  regulation  and  packet  scheduling  based  on  eligible  times  and  deadlines.  The 
natural  question  to  ask  is:  why  is  this  a  more  scalable  scheme  than  previous  solutions  requiring  per 
flow  management? 

There  are  several  scalability  bottlenecks  for  solutions  requiring  per  flow  management.  On  the 
data  path,  the  expensive  operations  are  per  flow  classification  and  scheduling.  On  the  control  path, 
complexity  results  from  the  maintenance  of  consistent  and  dynamic  state  in  a  distributed  environ¬ 
ment.  Among  the  three,  it  is  easiest  to  reduce  the  complexity  of  the  scheduling  algorithm  as  there 
is  a  natural  tradeoff  between  the  complexity  and  the  flexibility  of  the  scheduler  [119].  In  fact,  a 
number  of  techniques  have  already  been  proposed  to  reduce  scheduling  complexity,  including  those 
requiring  constant  time  complexity  [98,  120,  125]. 

We  also  note  that  due  to  the  way  we  regulate  traffic,  it  can  be  shown  that  with  very  high  proba¬ 
bility,  the  number  of  packets  in  the  server  at  any  given  time  is  significantly  smaller  than  the  number 
of  flows.  This  will  further  reduce  the  scheduling  complexity  and  in  addition  reduce  the  buffer  space 
requirement.  More  precisely,  in  Appendix  C  we  prove  the  following  result. 

Theorem  3  Consider  a  sender  traversed  by  ii  flows.  Assume  that  the  arrival  times  of  the  packets 
from  different  floM^s  are  independent,  and  that  all  packets  have  the  same  size.  Then,  for  any  given 
probability  e,  the  queue  size  at  any  instant  during  a  sender  busy  period  is  asymptotically  bounded 
above  by  s,  where 


with  a  probability  larger  than  I  —  e.  For  identical  reserx^ations  j3  =  1:  for  heterogeneous  reseiya- 
tions  (3  —  3. 

As  an  example,  let  n  =  10^,  and  s  =  10“^^,  which  is  the  same  order  of  magnitude  as  the 
probability  of  a  packet  being  corrupted  at  the  physical  layer.  Then,  by  Eq.  (5.1 1)  we  obtain  6*  =  4174 
if  all  flows  have  identical  reservations,  and  s  —  7230  if  flows  have  heterogeneous  reservations.  Thus 
the  probability  of  having  more  packets  in  the  queue  than  specified  by  Eq.  (5.1 1)  can  be  neglected  at 
the  level  of  the  entire  system  even  in  the  context  of  guaranteed  services. 
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#  flows  (n) 

bound  (6) 

max.  queue  size 

100 

31 

28 

1,000 

109 

100 

10,000 

374 

284 

100,000 

1276 

880 

1,000,000 

4310 

2900 

#  flows  (n) 

bound  (5) 

max.  queue  size 

100 

53 

30 

1,000 

188 

95 

10,000 

•  648 

309 

100,000 

2210 

904 

1,000,000 

7465 

2944 

(a)  (b) 

Table  5.2:  The  upper  bound  of  the  queue  size,  s*,  computed  by  Eq.  (5.11)  for  £  —  (where  n  is  the 
number  of  flows)  versus  the  maximum  queue  size  achieved  during  the  first  7i  time  slots  of  a  busy  period 
over  10'^  independent  trials,  during  the  first  n  time  slots  of  a  busy  period:  (a)  when  all  flows  have  identical 
reservations;  (b)  when  the  flows'  reservations  differ  by  a  factor  of  20. 


In  Table  5.2  we  compare  the  bounds  given  by  Eq.  (5.11)  to  simulation  results.  In  each  case 
we  report  the  maximum  queue  size  achieved  during  the  first  n  time  slots  of  a  busy  period  over 
10^  independent  trials.  We  note  that  in  the  case  of  all  flows  having  identical  reservations  we  are 
guaranteed  that  if  the  queue  does  not  overflow  during  the  first  7i  time  slots  of  a  busy  period,  it  will 
not  overflow  during  the  rest  of  the  busy  period  (see  Corollary  1).  Since  the  probability  that  a  buffer 
will  overflow  during  the  first  ii  time  slots  is  no  larger  than  7i  times  the  probability  of  the  buffer 
overflowing  during  an  arbitrary  time  slot,  we  use  e  =  to  compute  the  corresponding  bounds.^ 
The  results  show  that  our  bounds  are  reasonably  close  (within  a  factor  of  two)  when  all  reserva¬ 
tions  are  identical,  but  are  more  conservative  when  the  reservations  are  different.  Finally,  we  make 
three  comments.  First,  by  perfoiming  per  packet  regulation  at  every  core  node,  the  bounds  given 
by  Eq.  (5.1 1)  hold  for  any  core  node  and  are  independent  of  the  path  length.  Second,  if  the  flows’ 
arrival  patterns  are  not  independent,  we  can  easily  enforce  this  by  randomly  delaying  the  first  packet 
from  each  backlogged  period  of  the  flow  at  ingress  nodes.  This  will  increase  the  end-to-end  packet 
delay  by  at  most  the  queueing  delay  of  one  extra  hop.  Third,  the  bounds  given  by  Eq.  (5.11)  are 
asymptotic.  In  particular,  in  proving  the  results  in  Appendix  C  we  make  the  assumption  that  n  >  5. 
However,  this  a  reasonable  assumption  in  practice,  as  the  most  interesting  cases  involve  high  values 
for  n,  and,  as  suggested  by  Eq.  (5.1 1)  and  the  results  in  Table  5.2,  even  for  small  values  of  e  (e.g., 
10“^^),  n  is  much  larger  than  s. 

5.4  Control  Plane:  Admission  Control  With  No  Per  Flow  State 

A  key  component  of  any  architecture  that  provides  guaranteed  services  is  the  admission  control.  The 
main  job  of  the  admission  control  is  to  ensure  that  the  network  resources  are  not  over-committed. 
In  particular  it  has  to  ensure  that  the  sum  of  the  reservation  rates  of  all  flows  that  traverse  any  link  in 
the  network  is  no  larger  than  the  link  capacity,  i.e.,  <  C.  A  new  reservation  request  is  granted 

if  it  passes  the  admission  test  at  each  hop  along  its  path.  As  discussed  in  this  chapter’s  introduction, 
implementing  such  a  functionality  is  not  trivial:  traditional  distributed  architectures  based  on  sig¬ 
naling  protocols  are  not  scalable  and  are  less  robust  due  to  the  requirement  of  maintaining  dynamic 
and  replicated  state;  centralized  architectures  have  scalability  and  availability  concerns. 

^More  formally,  let  e'  be  the  probability  that  the  buffer  does  not  overflow  during  the  first  n  time  slots  of  the  busy 
period.  Then  by  taking  e'  =  n  -  e,  Eq.  (5,1 1)  becomes  s  =  0n{\n  n  —  (lne')/2-l). 


73 


sender  receiver 


Figure  5.4:  Ingress-egress  admission  control  when  RSVP  is  used  outside  the  SCORE  domain. 

In  this  section,  we  propose  a  fully  distributed  architecture  for  implementing  admission  control. 
Like  most  distributed  admission  control  architectures,  in  our  solution,  each  node  keeps  track  of 
the  aggregate  reservation  rate  for  each  of  its  out-going  links  and  makes  local  admission  control 
decisions.  However,  unlike  existing  reservation  protocols,  this  distributed  admission  control  process 
is  achieved  without  core  nodes  maintaining  per  flow  state. 

5.4.1  Ingress-to-Egress  Admission  Control 

We  consider  an  architecture  in  which  a  lightweight  signaling  protocol  is  used  within  the  SCORE 
domain.  Edge  routers  are  the  interface  between  this  signaling  protocol  and  an  inter-domain  signaling 
protocol  such  as  RSVR  For  the  purpose  of  this  discussion,  we  consider  only  unicast  reservations. 
In  addition,  we  assume  a  mechanism  like  the  one  proposed  by  Stoica  and  Zhang  [103]  or  Multi- 
Protocol  Label  Switching  (MPLS)  [17]  that  can  be  used  to  pin  a  flow  to  a  route. 

From  the  point  of  view  of  RSVP,  a  path  through  the  SCORE  domain  is  just  a  virtual  link.  There 
are  two  basic  control  messages  in  RSVP:  Path  and  Resw  These  messages  are  processed  only  by  edge 
nodes;  no  operations  are  performed  inside  the  domain.  For  the  ingress  node,  upon  receiving  a  Path 
message,  it  simply  forwards  it  through  the  domain.  For  the  egress  node,  upon  receiving  the  first  Resv 
message  for  a  flow  (i.e.,  there  was  no  RSVP  state  for  the  flow  at  the  egress  node  before  receiving 
the  message),  it  will  forward  the  message  (message  ‘‘1”  in  Figure  5.4)  to  the  corresponding  ingress 
node,  which  in  turn  will  send  a  special  signaling  message  (message  “2”  in  Figure  5.4)  along  the  path 
toward  the  egress  node.  Upon  receiving  the  signaling  message,  each  node  along  the  path  performs  a 
local  admission  control  test  as  described  in  Section  5.4.2.  In  addition,  the  message  cames  a  counter, 
h,  that  is  incremented  at  each  hop.  The  final  value  h  is  used  for  computing  the  slack  delay,  S,  (see 
Eq.  (5.10)).  If  we  use  the  route  pinning  mechanism  described  in  Chapter  6,  message  “2”  is  also 
used  to  compute  the  label  of  the  path  between  the  ingress  and  egress.  This  label  is  used  then  by  the 
ingress  node  to  make  sure  that  all  data  packets  of  the  flow  are  forwarded  along  the  same  path.  When 
the  signaling  message  “2”  reaches  the  egress  node,  it  is  reflected  back  to  the  sender,  which  makes 
the  final  decision  (message  “3”  in  Figure  5.4).  RSVP  refresh  messages  for  a  flow  that  already  has 
per  flow  RSVP  state  installed  at  edge  routers  will  not  trigger  additional  signaling  messages  inside 
the  domain. 

Since  RSVP  uses  raw  IP  or  UDP  to  send  control  messages,  there  is  no  need  for  retransmission 
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Notation 

Comments 

Ti 

flow  Vs  reserved  rate 

bf 

total  number  of  bits  flow  i  is  entitled  to  transmit 
during  [sf,7\4i]>  i-e-.  -  s*;7^) 

m 

aggregate  reservation  at  time  t 

Rboundil) 

upper  bound  of  R{t),  used  by  admission  test 

Ropsit) 

estimate  of  computed  by  using  DPS 

Rnew  {i) 

sum  of  all  new  reservations  accepted  from  the 
beginning  of  current  estimation  interval  until  t 

Rcal{t) 

upper  bound  of  R{t),  used  to  calibrate  Rbound^ 
computed  based  on  Rdps  and  Rne  w 

Table  5.3:  Notations  used  in  Section  5.4.3. 

for  our  signaling  messages,  as  message  loss  will  not  break  the  RSVP  semantics.  If  the  sender  does 
not  receive  a  reply  after  a  certain  timeout,  it  simply  drops  the  Resv  message.  In  addition,  as  we  will 
show  in  Section  5.4.3,  there  is  no  need  for  a  special  tei*mination  message  inside  the  domain  when  a 
flow  is  tom  down. 

5.4.2  Per-Hop  Admission  Control 

Each  node  needs  to  ensure  that  ^  ^  holds  at  all  times.  At  first  sight,  one  simple  solution 
that  implements  this  test  and  also  avoids  per  flow  state  is  for  each  node  to  maintain  the  aggregate 
reserved  rate,  R,  where  R  is  updated  io  R  =  R-\-  r  when  a  new  flow  with  the  reservation  rate  r  is 
admitted,  and  Xo  R  —  R  —  when  a  flow  with  the  reservation  rate  r'  terminates.  The  admission 
control  then  reduces  to  checking  whether  R  r  <  C  holds.  However,  it  can  be  easily  shown 
that  such  a  simple  solution  is  not  robust  with  respect  to  various  failure  conditions  such  as  packet 
loss,  partial  reservation  failures,  and  network  node  crashes.  To  handle  packet  loss,  when  a  node 
receives  a  set-up  or  tear-down  message,  the  node  has  to  be  able  to  tell  whether  it  is  a  duplicate  of  a 
message  already  processed.  To  handle  partial  reservation  failures,  a  node  needs  to  “remember”  what 
decision  it  made  for  the  flow  in  a  previous  pass.  That  is  why  all  existing  solutions  maintain  per  flow 
reservation  state,  be  it  hard  state  as  in  ATM  UNI  or  soft  state  as  in  RSVP.  However,  maintaining 
consistent  and  dynamic  state  in  a  distributed  environment  is  itself  challenging.  Fundamentally,  this 
is  because  the  update  operations  assume  a  transaction  semantic,  which  is  difficult  to  implement  in 
a  distributed  envii'onment  [4,  1 17]. 

In  the  remaining  of  the  section,  we  show  that  by  using  DPS,  it  is  possible  to  significantly  reduce 
the  complexity  of  admission  control  in  a  distributed  environment.  Before  we  present  the  details  of 
the  algorithm,  we  point  out  that  our  goal  is  to  estimate  a  close  upper  bound  on  the  aggregate  reserved 
rate.  By  using  this  bound  in  the  admission  test  we  avoid  over-provisioning,  which  is  a  necessary 
condition  to  provide  deterministic  service  guarantees.  This  is  in  contrast  to  many  measurement- 
based  admission  control  algorithms  [62,  110],  which,  in  the  context  of  supporting  controlled  load 
or  statistical  services,  base  their  admission  test  on  the  measurement  of  the  actual  amount  of  traffic 
transmitted.  To  achieve  this  goal,  our  algorithm  uses  two  techniques.  First,  a  conservative  upper 
bound  of  i?,  denoted  Rboundy  is  maintained  at  each  core  node  and  is  used  for  making  admission 
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control  decisions.  Rhnvnd  is  updated  with  a  simple  rule:  Rbovnd  =  Rboimd  +  r  whenever  a  new 
request  of  a  rate  r  is  accepted.  It  should  be  noted  that  in  order  to  maintain  the  invariant  that  Rho„„ri 
is  an  upper  bound  of  R,  this  algorithm  does  not  need  to  detect  duplicate  request  messages,  generated 
either  due  to  retransmission  in  case  of  packet  loss  or  retiy  in  case  of  partial  reservation  failures.  Of 
course,  the  obvious  problem  with  this  algorithm  is  that  Ri,„„r,d  will  diverge  from  R.  In  the  limit, 
when  Ritound  reaches  the  link  capacity  C,  no  new  requests  can  be  accepted  even  though  there  might 
be  available  capacity. 

To  address  this  problem,  a  separate  algorithm  is  introduced  to  periodically  estimate  the  aggregate 
reserved  rate.  Based  on  this  estimate,  a  second  upper  bound  for  R,  denoted  is  computed  and 
used  to  recalibrate  Rbotmd-  An  important  aspect  of  the  estimation  algorithm  is  that  the  discrepancy 
between  the  upper  bound  R^.„i  and  the  actual  reserved  rate,  R,  can  be  bounded.  The  recalibration 
then  becomes  the  choice  between  the  minimum  of  the  two  upper  bounds  Rbomid  and  Rmi-  The 
estimation  algorithm  is  based  on  DPS  and  does  not  require  core  routers  to  maintain  per  flow  state. 

Our  algorithms  have  several  important  properties.  First,  they  are  robust  in  the  presence  of  net¬ 
work  losses  and  partial  reservation  failures.  Second,  while  they  can  over-estimate  R,  they  will  never 
underestimate  R.  This  ensures  the  semantics  of  the  guaranteed  service  -  while  over-estimation  can 
lead  to  under-utilization  of  network  resources,  under-estimation  can  result  in  over-provisioning  and 
violation  of  peiformance  guarantees.  Finally,  the  proposed  estimation  algorithms  are  self-correcting 
in  the  sense  that  over-estimation  in  a  previous  period  will  be  corrected  in  the  next  period.  This 
greatly  reduces  the  possibility  of  serious  resource  under-utilization. 

5.4.3  Aggregate  Reservation  Estimation  Algorithm 

In  this  section,  we  present  the  estimation  algorithm  of  the  aggregate  reserved  rate  which  is  performed 
at  each  core  node.  In  particular,  we  will  describe  how  R(„i  is  computed  and  how  it  is  used  to 
recalibrate  Rbound-  In  designing  the  algorithm  for  computing  R,  „i,  we  want  to  balance  between  two 
goals:  (a)  should  be  an  upper  bound  on  7?;  (b)  over-estimation  errors  should  be  corrected  and 
kept  to  the  minimum. 

To  compute  R^ah  we  start  with  an  inaccurate  estimate  of  7?,  denoted  Rdps,  and  then  make 
adjustments  to  account  for  estimation  inaccuracies.  In  the  following,  we  first  present  the  algorithm 
that  computes  Rdps^  then  describe  the  possible  inaccuracies  and  the  corresponding  adjustment 
algorithms. 

The  estimate  Rdps  is  calculated  using  the  DPS  technique:  ingress  nodes  insert  additional  state 
in  packet  headers,  state  which  is  in  turn  used  by  core  nodes  to  estimate  the  aggregate  reservation  R. 
In  particular,  a  new  state,  ,  is  inserted  in  the  header  of  packet  pf: 

=  (5.12) 

where  and  j  are  the  times  the  packets  and  pf  are  transmitted  by  the  ingress  node. 
Therefore,  bf  represents  the  total  amount  of  bits  that  flow  i  is  entitled  to  send  during  the  interval 
J.  The  computation  of  Rdps  is  based  on  the  following  simple  observation:  the  sum  of 
b  values  of  all  packets  of  flow  i  during  an  interval  is  a  good  approximation  for  the  total  number 
of  bits  that  flow  i  is  entitled  to  send  during  that  interval  according  to  its  reserved  rate.  Similarly, 
the  sum  of  b  values  of  all  packets  is  a  good  approximation  for  the  total  number  of  bits  that  all 
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Figure  5.5:  The  scenario  in  which  the  lower  bound  of  6/,  i.e.,  ri{T^^  -Tj  -  Tj),  is  achieved.  The  arrows 
represent  packet  transmissions.  Tu  is  the  averaging  window  size;  T/  is  an  upper  bound  on  the  packet  inter¬ 
departure  time;  Tj  is  an  upper  bound  on  the  delay  jitter.  Both  ml  and  m2  miss  the  estimation  interval 
Tw. 


flows  are  entitled  to  send  during  the  corresponding  interval.  Dividing  this  sum  by  the  length  of 
the  interval  gives  the  aggregate  reservation  rate.  More  precisely,  let  us  divide  time  into  intervals  of 
length  T\v:  k  >  0.  Let  ujt+i)  be  the  sum  of  b  values  of  packets  in  flow  i  received 

during  {uk^  y'k-\-i]y  Uk+i)  be  the  sum  of  b  values  of  all  packets  during  (u/^,  u/c+i]*  The 

estimate  is  then  computed  at  the  end  of  each  interval  {uk,  Wfc+i]  as  follows 

RDPs{uk+,)  =  (5.13) 

While  simple,  the  above  algorithm  may  introduce  two  types  of  inaccuracies.  First,  it  ignores 
the  effects  of  the  delay  jitter  and  the  packet  inter-departure  times.  Second,  it  does  not  consider  the 
effects  of  accepting  or  terminating  a  reservation  in  the  middle  of  an  estimation  interval.  In  particular, 
having  newly  accepted  flows  in  the  interval  may  result  in  the  under-estimation  of  R{t)  by  Rupsii)^ 
To  illustrate  this,  consider  the  following  simple  example:  there  are  no  guaranteed  flows  on  a  link 
until  a  new  request  with  rate  r  is  accepted  at  the  end  of  an  estimation  interval  If  no  data 

packet  from  the  new  flow  reaches  the  node  before  u^+i,  '^k+i)  would  be  0,  and  so  would  be 

Ropsi'^k+i)-  However,  the  correct  value  should  be  r. 

In  the  following,  we  present  the  algorithm  to  compute  an  upper  bound  of  denoted 

Rcai{uk-\-i)^  In  doing  this  we  account  for  both  types  of  inaccuracies.  Let  C{t)  denote  the  set 
of  reservations  at  time  L  Our  goal  is  then  to  bound  the  aggregate  reservation  at  time  Wfc+i,  i.e., 
Ri'^k^i)  ~  Consider  the  division  of  C{uk^i)  into  two  subsets:  the  subset  of  new 

reservations  that  were  accepted  during  the  interval  denoted  and  the  subset 

containing  the  rest  of  reservations  which  were  accepted  no  later  than  Next,  we  express 

B{^k-\-i)  ns 


R{uk+i)=  n+  (5.14) 

i€£(ufc+i)\A/'(ufc+i)  ieA/'(«fc+i) 

The  idea  is  then  to  derive  an  upper  bound  for  each  of  the  two  right-hand  side  terms,  and  compute 
Real  as  the  sum  of  these  two  bounds.  To  bound  Z)ie£(uj,+i)\A/'(M*+i)  ^lat 
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(5.15) 


ieC{v^.^i)\A'{‘in.+i) 

The  reason  that  (5.15)  is  an  inequality  instead  of  an  equality  is  that  when  there  are  flows  terminating 
during  the  interval  {?/./,..  their  packets  may  still  have  contributed  to  J5 (//,/.. '?/-/, •+ 1 )  even  though 

they  do  not  belong  to  £(n/,.+i )  \  i ).  Next,  we  compute  a  lower  bound  for  4.1 ).  By 

definition,  since  i  E  )  \  ),  it  follows  that  flow  i  holds  a  reservation  during  the  entire 

interval  Let  Tj  be  the  maximum  inter-depailure  time  between  two  consecutive  packets 

of  a  flow  at  the  edge  node,  and  let  T/  be  the  maximum  delay  jitter  of  a  flow.  Tj  represents  the 
maximum  difference  between  the  delays  experienced  by  two  packets  between  ingress  and  egress 
nodes.  The  ingress-egress  delay  of  a  packet  represents  the  difference  between  the  anival  time  of  the 
packet  at  the  egress  node  and  the  departure  time  of  the  packet  at  the  ingress  node.  In  the  remainder 
of  this  section,  we  assume  that  T\y  is  chosen  such  that  both  Tj  and  Tj  are  much  smaller  than  T\\\ 
Now,  consider  the  scenario  shown  in  Figure  5.5  in  which  a  core  node  receives  the  packets  inl  and 
7n2  just  outside  the  estimation  window.  Assuming  the  worst  case  in  which  rnl  incurs  the  lowest 
possible  delay,  77i2  incurs  the  maximum  possible  delay,  and  that  the  last  packet  before  7n2  departs 
Tf  seconds  earlier,  it  is  easy  to  see  that  that  the  sum  of  the  b  values  earned  by  the  packets  received 
during  the  estimation  interval  by  the  core  node  cannot  be  smaller  than  ri{T\y  —Tj  —  Ty).  Thus,  we 
have 


bi{7ik:'^^k-\-\)  >  ri{T\y  -Tj  -  Tj), 
Vi  G  C{ut;+i)  \ 


(5.16) 

(5.17) 


By  combining  Eqs.  (5.15)  and  (5.16),  and  Eq.  (5.13)  we  obtain 


zGC(wa.4.i)VV(wa  +  i) 


<  E 

ieA:(in+i)\A'(«A+i) 
RdPs{ui;^\) 


hi{ut,,Uk+\) 


< 


1-/ 


(5.18) 


where  f  =  {T[  +  T.j)/T\y. 

Next,  we  bound  the  second  right-hand  side  term  in  Eq.  (5.14):  intro¬ 
duce  a  new  global  variable  Rnew  is  initialized  at  the  beginning  of  each  interval  (  a^:,  +  to 

zero,  and  is  updated  to  Rnew  +  every  time  a  new  reservation,  r,  is  accepted.  Let  Rnewi'^)  denote 
the  value  of  this  variable  at  time  t.  For  simplicity,  here  we  assume  that  a  flow  which  is  granted  a 
reservation  during  the  interval  {uk:7ik-\-\]  becomes  active  no  later  than  Then  it  is  easy  to  see 

that 


^  Ti  <  Rnewiuky]).  (5.19) 

iGA/'(wA-+i) 

“Otherwise,  to  account  for  the  case  in  which  a  reservation  accepted  during  the  interval  {uk-i.Vk]  becomes  active 
after  Uk  +  RTT,  we  need  to  subtract  RTT  x  Rnewit^k)  from  B{uk,Uk+i). 
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Per-hop  Admission  Control 

on  reservation  request  r 

if  (Rbound  +  r  <  C)  peifonn  admission  test  */ 
Rnew  ~  Rneir  ^ ? 

Rbound  —  Rbound  ^ » 


accept  request; 
else 

deny  request; 

on  reservation  termination  7*  /*  optional  */ 


Rbound  —  Rbound 


Aggregate  Reservation  Bound  Comp. _ 

on  packet  arrival  p 

b  getJ){p);  /*  get  b  value  inserted  by  ingress  {Eq.  (5A2))^I 
L  -  L  +  6; 
on  time-out  Tw 

Rops  ~  L/Tiv;  /*  estimate  aggregate  re  sensation  */ 

Rbound  =  mm{Rb  ound->  ■R£)P.s/(1  -  f)  +  R  new)‘’> 

Rnew  ~  0’ 


Figure  5.6:  The  control  path  algorithms  executed  by  core  nodes;  Rnew  is  initialized  to  0. 

Eq.  (5.19)  holds  when  no  duplicate  reservation  requests  are  processed,  and  none  of  the  new  accepted 
reservations  terminate  during  the  interval.  Then  we  define  Rcai{y^k+i) 

Rcal{Uk+l)  =  +  Rne^viuM).  (5.20) 

From  Eq.  (5.14),  and  Eqs.  (5.18)  and  (5.19),  it  follows  easily  that  Rcaliuk+i)  is  an  upper  bound  for 
R(uk+i),  i.e.,  Rcaiiuk+i)  >  Riuk+i)-  Finally,  we  use  Rcai{uk+i)  to  recalibrate  the  upper  bound 
of  the  aggregate  reservation,  Rbound^  at  Ufc+i  as 

Rboundi.'^k-\-\)  ~  ^^^{Rboundi.'^k^')  RcaliV'k+l^')*  (5*21) 

Figure  5.6  shows  the  pseudocode  of  the  control  algorithms  at  core  nodes.  Next  we  make  several 
observations. 

First,  the  estimation  algorithm  uses  only  the  information  in  the  current  interval.  This  makes 
the  algorithm  robust  with  respect  to  loss  and  duplication  of  signaling  packets  since  their  effects 
are  “forgotten”  after  one  time  interval.  As  an  example,  if  a  node  processes  both  the  original  and 
a  duplicate  of  the  same  reservation  request  during  the  interval  (u^,  njt+i],  Rbound  will  be  updated 
twice  for  the  same  flow.  However,  this  erroneous  update  will  not  be  reflected  in  the  computation  of 
RDPs{'>J'k+2)i  since  its  computation  is  based  only  on  the  b  values  received  during  (u^+i,  14^+2] • 

As  a  consequence,  an  important  property  of  our  admission  control  algorithm  is  that  it  can  asymp¬ 
totically  reach  a  link  utilization  ofC'(l-/)/(l-|-/).In  particular,  the  following  property  is  proven 
in  Appendix  B.4: 
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Host  1 


Host  2 


Hosts 


Host  4 


Figure  5.7:  The  test  configuration  used  in  experiments. 

Theorem  4  Consider  a  link  of  capacity  C  at  time  t.  Assume  that  no  resefration  terminates  and 
there  are  no  reseiTation  failures  or  request  losses  after  time  t.  Then  if  there  is  sufficient  demand 
after  t  the  link  utilization  approaches  asymptotically  C(1  —  /)/(!  +  /). 

Second,  note  that  since  Rcuiiur)  is  an  upper  bound  of  a  simple  solution  would  be  to  use 

Rcai['^^>k)  +  R-iiciu^  instead  of  R^hoimd^  to  perform  the  admission  test  during  (7/,^..  7/^4 ]].  The  problem 
with  this  approach  is  that  can  overestimate  the  aggregate  reservation  R.  An  example  is  given 
in  Section  5.5  to  illustrate  this  issue  (Figure  5.10(b)). 

Third,  we  note  that  a  possible  optimization  of  the  admission  control  algorithm  is  to  add  reser¬ 
vation  termination  messages  (see  Figure  5.6).  This  will  reduce  the  discrepancy  between  the  upper 
bound,  Rbound^  and  the  aggregate  reservation  R.  However,  in  order  to  guarantee  that  R, hound  re¬ 
mains  an  upper  bound  for  i?,  we  need  to  ensure  that  a  termination  message  is  sent  at  most  once, 
i.e.,  there  are  no  retransmissions  if  the  message  is  lost.  In  practice,  this  property  can  be  enforced  by 
edge  nodes,  which  maintain  per  flow  state. 

Finally,  to  ensure  that  the  maximum  inter-departure  time  is  no  larger  than  T/,  the  ingress  node 
may  need  to  send  a  dummy  packet  in  the  case  when  no  data  packet  arrives  for  a  flow  during  an 
interval  T/.  This  can  be  achieved  by  having  the  ingress  node  maintain  a  timer  with  each  flow.  An 
optimization  would  be  to  aggregate  all  ‘‘micro-flows”  between  each  pair  of  ingress  and  egress  nodes 
into  one  flow,  compute  b  values  based  on  the  aggregated  reservation  rate,  and  insert  a  dummy  packet 
only  if  there  is  no  data  packet  of  the  aggregate  flow  during  an  interval. 


5.5  Experimental  Results 

We  have  fully  implemented  the  algorithms  described  in  this  chapter  in  FreeBSD  v2.2.6  and  deployed 
them  in  a  testbed  consisting  of  266  MHz  and  300  MHz  Pentium  II  PCs  connected  by  point-to-point 
100  Mbps  Ethernets.  The  testbed  allows  the  configuration  of  a  path  with  up  to  two  core  routers.  The 
details  of  the  implementations  and  of  the  state  encoding  are  presented  in  Chapter  8. 

In  the  remainder  of  this  section,  we  present  results  from  four  simple  experiments.  The  ex¬ 
periments  are  designed  to  illustrate  the  microscopic  behaviors  of  the  algorithms,  rather  than  their 
scalability.  All  experiments  were  run  on  the  topology  shown  in  Figure  5.7.  The  first  router  is  con¬ 
figured  as  an  ingress  node,  while  the  second  router  is  configured  as  an  egress  node.  An  egress  node 
also  implements  the  functionalities  of  a  core  node.  In  addition,  it  restores  the  initial  values  of  the 
field.  All  traffic  is  UDP  and  all  packets  are  1000  bytes,  not  including  the  header. 

In  the  first  experiment  we  consider  a  flow  between  hosts  1  and  3  that  has  a  reservation  of  10 
Mbps  but  sends  at  a  much  higher  rate  of  about  30Mbps.  Figures  5.8(a)  and  (b)  plot  the  arrival  and 
departure  times  for  the  first  30  packets  of  the  flow  at  the  ingress  and  egress  node,  respectively.  One 
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Figure  5.8:  Packet  arrival  and  departure  times  for  a  10  Mbps  flow  at  (a)  the  ingress  node,  and  (b)  the  egress 
node. 


thing  to  notice  in  Figure  5.8(a)  is  that  the  arrival  rate  at  the  ingress  node  is  almost  three  times  the 
departure  rate,  which  is  the  same  as  the  reserved  rate  of  10  Mbps.  This  illustrate  the  non  work 
conserving  nature  of  the  CJVC  algorithm,  which  enforces  the  traffic  profile  and  allows  only  10 
Mbps  traffic  into  the  network.  Another  thing  to  notice  is  that  all  packets  incur  about  0.8  ms  delay  in 
the  egress  node.  This  is  because  they  are  sent  by  the  ingress  node  as  soon  as  they  become  eligible, 
and  therefore  ~  //r  =  8  x  1052  bits/lOMbps  —  0.84  ms.  As  a  result,  they  will  be  held  in  the 
rate-controller  for  this  amount  of  time  at  the  next  hop^,  which  is  the  egress  node  in  our  case. 

In  the  second  experiment  we  consider  three  guaranteed  flows  between  hosts  1  and  3  with  reser¬ 
vations  of  10  Mbps,  20  Mbps,  and  40  Mbps,  respectively.  In  addition,  we  consider  a  fourth  UDP 
flow  between  hosts  2  and  4  which  is  treated  as  best  effort.  The  arrival  rates  of  the  first  three  flows 
are  slightly  larger  than  their  reservations,  while  the  arrival  rate  of  the  fourth  flow  is  approximately 
60  Mbps.  At  time  0,  only  the  best-effort  flow  is  active.  At  time  2.8  ms,  the  first  three  flows  be¬ 
come  simultaneously  active.  Flows  1  and  2  terminate  after  sending  12  and  35  packets,  respectively. 
Figure  5.9  shows  the  packet  arrival  and  departure  times  for  the  best-effort  flow  4,  and  the  packet  de¬ 
parture  times  for  the  real-time  flows  1,  2,  and  3.  As  can  be  seen,  the  best-effort  packets  experience 
very  low  delay  in  the  initial  period  of  2.8  ms.  After  the  guaranteed  flows  become  active,  best-effort 
packets  experience  longer  delays  while  guaranteed  flows  receive  service  at  their  reserved  rate.  After 
flow  1  and  2  terminate,  the  best-effort  traffic  grabs  the  remaining  bandwidth. 

The  last  two  experiments  illustrate  the  algorithms  for  admission  control  described  in  Section  5.4.3. 
The  first  experiment  demonstrates  the  accuracy  of  estimating  the  aggregate  reservation  based  on  the 
b  values  carried  in  the  packet  headers.  The  second  experiment  illustrates  the  computation  of  the  ag¬ 
gregate  reservation  bound,  Rbound^  when  a  new  reservation  is  accepted  or  a  reservation  terminates. 
In  these  experiments  we  use  an  averaging  interval,  Tw,  of  5  seconds,  and  a  maximum  inter-departure 
time,  Tj,  of  500  ms.  Because  all  packets  have  the  same  size,  the  ingress  to  egress  delays  experienced 
by  any  two  packets  of  the  same  flow  are  practically  the  same.  As  a  result,  we  neglect  the  delay  jitter, 
i.e.,  we  assume  Tj  =  0.  This  gives  ns  f  =  (T/  -h  Tj)/Tw  =  0.1. 

^Note  that  since  all  packets  have  the  same  size,  5  =  0. 
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Figure  5.9:  The  packets'  arrival  and  departure  times  for  four  flows.  The  first  three  flows  are  guaranteed,  with 
reservations  of  10  Mbps,  20  Mbps,  and  40  Mbps.  The  last  flow  is  best  effort  with  an  arrival  rate  of  about  60 
Mbps. 


In  the  first  experiment  we  consider  two  flows,  one  with  a  reservation  of  0.5  Mbps,  and  the  other 
with  a  reservation  of  1.5  Mbps.  Figure  5.10(a)  plots  the  anival  rate  of  each  flow,  as  well  as  the 
arrival  rate  of  the  aggregate  traffic.  In  addition.  Figure  5.10(a)  plots  the  bound  of  the  aggregate 
reservation  used  by  admission  test,  B.botirid^  the  estimate  of  the  aggregate  reservation  Bjyrs^  and  the 
bound  B,cai  used  to  recalibrate  B.^oynd^  According  to  the  pseudocode  in  Figure  5.6,  both  B.i)j>s  and 
Beal  are  updated  at  the  end  of  each  estimation  interval.  More  precisely,  every  5  seconds  B.f)}>s  is 
computed  based  on  the  b  values  earned  in  the  packet  headers,  while  Bed  is  computed  as  B.[)ps/{i  — 
f)  +  B,new  Note  that  since,  in  this  case,  no  new  reservation  is  accepted,  we  have  Baeir  =  0, 
which  yields  B^eai  —  B,[)ps/{1  —  /).  The  important  thing  to  note  in  Figure  5.10(a)  is  that  the  rate 
variation  of  the  actual  traffic  (represented  by  the  continuous  line)  has  little  effect  on  the  accuracy 
of  computing  the  aggregate  reservation  estimate  Bpps,  and  consequently  of  B,eap  In  contrast, 
traditional  measurement  based  admission  control  algorithms,  which  base  their  estimation  on  the 
actual  traffic,  would  significantly  underestimate  the  aggregate  reservation,  especially  during  the 
time  periods  when  no  data  packets  are  received.  In  addition,  note  that  since  in  this  experiment  B,ea} 
is  always  lai*ger  than  B.boun(h  and  no  new  reservations  are  accepted,  the  value  of  B^bound  is  never 
updated. 

In  the  second  experiment  we  consider  a  scenario  in  which  a  new  reservation  of  0.5  Mbps  is 
accepted  at  time  t  —  18  sec  and  terminates  approximately  at  time  t  —  39  sec.  For  the  entire  time 
duration,  plotted  in  Figure  5.10(b),  we  have  background  traffic  with  an  aggregate  reservation  of 
0.5  Mbps.  Similar  to  the  previous  case,  we  plot  the  rate  of  the  aggregate  traffic,  and,  in  addition, 
B^bound^  BeoU  ^^d  Bpps*  There  are  several  points  worth  noting.  First,  when  the  reservation  is 
accepted  at  time  t  —  18  sec,  B^bound  increases  by  the  value  of  the  accepted  reservation,  i.e.,  0.5 
Mbps  (see  Figure  5.6).  In  this  way,  B^bound  is  guaranteed  to  remain  an  upper  bound  of  the  aggregate 
reservation  B.  In  contrast,  since  both  Bpps  and  Bcai  are  updated  only  at  the  end  of  the  estimation 
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Figure  5.10:  The  estimate  aggregate  reservation  Rcah  and  the  bounds  Rhound  and  Real  in  the  case  of  (a) 
two  ON-OFF  flows  with  reservations  of  0.5  Mbps,  and  1.5  Mbps,  respectively,  and  in  the  case  when  (b)  one 
reservation  of  0.5  Mbps  is  accepted  at  time  t  =  IS  seconds,  and  then  is  terminated  at  i  ==  39  seconds. 
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Table  5.4:  The  average  and  standard  deviation  of  the  enqueue  and  dequeue  times,  measured  in  /iS. 

interval,  they  underestimate  the  aggregate  reservation,  as  well  as  the  aggregate  traffic,  before  time 
^  =  20  sec.  Second,  after  Real  is  updated  at  time  t  =  20  sec,  SiS  Rdps/{^  —  f)  +  Rnew,  the  new 
value  significantly  overestimates  the  aggregate  reservation.  This  is  the  main  reason  for  which  we  do 
not  use  Real  {+Rnew)y  but  Rbound^  to  do  the  admission  control  test.  Third,  note  that  unlike  the  case 
when  the  reservation  was  accepted,  Rbound  does  not  change  when  the  reservation  terminates  at  time 
t  =  39  sec.  This  is  simply  because  in  our  implementation  no  tear-down  message  is  generated  when 
a  reservation  terminates.  However,  as  Real  is  updated  at  the  end  of  the  next  estimation  interval  (i.e., 
at  time  f  =  45  sec),  Rbound  drops  to  the  correct  value  of  0.5  Mbps.  This  shows  the  importance  of 
using  Real  to  recalibrate  Rbound^  In  addition,  this  illustrates  the  robustness  of  our  algorithm,  i.e., 
the  overestimation  in  a  previous  period  is  corrected  in  the  next  period.  Finally,  note  that  in  both 
experiments  Rdps  always  underestimates  the  aggregate  reservation.  This  is  due  to  the  truncation 
errors  in  computing  both  the  b  values  and  the  Rdps  estimate. 

5.5.1  Processing  Overhead 

To  evaluate  the  overhead  of  our  algorithm  we  have  performed  three  experiments  on  a  300  MHz 
Pentium  II  involving  1,  10,  and  100  flows,  respectively.  The  reservation  and  actual  sending  rates  of 
all  flows  are  identical.  The  aggregate  sending  rate  is  about  20%  larger  than  the  aggregate  reservation 
rate.  Table  5.4  shows  the  means  and  the  standard  deviations  for  the  enqueue  and  dequeue  times  at 
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both  ingress  and  egress  nodes.  Each  of  these  numbers  is  based  on  a  measurement  of  1000  packets. 
For  comparison  we  also  show  the  enqueue  and  dequeue  times  for  the  unmodified  code.  There  are 
several  points  worth  noting.  First,  our  implementation  adds  less  than  5  //,s  overhead  per  enqueue 
operation,  and  about  2  fis  per  dequeue  operation.  In  addition,  both  the  enqueue  and  dequeue  times 
at  the  ingress  node  are  greater  than  at  the  egress  node.  This  is  because  ingress  node  performs  per 
flow  operations.  Fuithennore,  as  the  number  of  flows  increases,  the  enqueue  times  increase  only 
slightly,  i.e.,  by  less  than  20%.  This  suggests  that  our  algorithm  is  indeed  scalable  in  the  number 
of  flows.  Finally,  the  dequeue  times  actually  decrease  as  the  number  of  flows  increases.  This  is 
because  the  rate-controller  is  implemented  as  a  calendar  queue  with  each  entry  coiTesponding  to 
a  128  //s  time  interval.  Packets  with  eligible  times  falling  between  the  same  interval  are  stored  in 
the  same  entry.  Therefore,  when  the  number  of  flows  is  large,  more  packets  are  stored  in  the  same 
calendar  queue  entiy.  Since  all  these  packets  are  transfeired  during  one  operation  when  they  become 
eligible,  the  actual  overhead  per  packet  decreases. 


5.6  Related  Work 

The  idea  of  implementing  guaranteed  services  by  using  a  stateless  core  architecture  was  proposed  by 
Jacobson  [76]  and  Clark  [24],  and  is  now  being  pursued  by  the  IETF  Diffserv  working  group  [32]. 
There  are  several  differences  between  our  scheme  and  the  existing  Diffserv  proposals.  First,  our 
DPS  based  algorithms  operate  at  a  much  finer  granularity  both  in  terms  of  time  and  traffic  aggre¬ 
gates:  the  state  embedded  in  a  packet  can  be  highly  dynamic,  as  it  encodes  the  current  state  of  the 
flow,  rather  than  the  static  and  global  properties  such  as  dropping  or  scheduling  priority.  In  addition, 
the  goal  of  our  scheme  is  to  implement  distributed  algorithms  that  tiy  to  approximate  the  services 
provided  by  a  network  in  which  all  routers  implement  per  flow  management.  Therefore,  we  can  pro¬ 
vide  service  differentiation  and  performance  guarantees  in  terms  of  both  delay  and  bandwidth  on  a 
per  flow  basis.  In  contrast,  the  Premium  service  can  provide  only  per  flow  bandwidth  guarantees. 
Finally,  we  propose  fully  distributed  and  dynamic  algorithms  for  implementing  both  data  and  con¬ 
trol  functionalities,  where  existing  Diffserv  solutions  rely  on  more  centralized  and  static  algorithms 
for  implementing  admission  control. 

In  this  chapter,  we  propose  a  technique  to  estimate  the  aggregate  reservation  rate  and  use  that 
estimate  to  perform  admission  control.  While  this  may  look  similar  to  measurement-based  admis¬ 
sion  control  algorithms  [62,  110],  the  objectives  and  thus  the  techniques  are  quite  different.  The 
measurement-based  admission  control  algorithms  are  designed  to  support  controlled-load  type  of 
services,  the  estimation  is  based  on  the  actual  amount  of  traffic  transmitted  in  the  past,  and  is  usu¬ 
ally  an  optimistic  estimate  in  the  sense  that  the  estimated  aggregate  rate  is  smaller  than  the  aggregate 
reserved  rate.  While  this  has  the  benefit  of  increasing  the  network  utilization  by  the  controlled-load 
service  traffic,  it  has  the  risk  of  incurring  transient  overloads  that  may  cause  the  service  degradation. 
In  contrast,  our  algorithm  aims  to  support  guaranteed  service,  and  the  goal  is  to  estimate  a  close 
upper  bound  on  the  aggregate  resented  rate  even  when  the  the  actual  arrival  rate  may  vary. 

Cruz  [29]  proposed  a  novel  scheduling  algorithm  called  SCED+  in  the  context  of  ATM  networks. 
In  SCED-h,  virtual  circuits  sharing  a  same  path  segment  are  aggregated  into  a  virtual  path.  At  each 
switch,  only  per  virtual  path  state  instead  of  per  virtual  circuit  state  needs  to  be  maintained  for 
scheduling  purpose.  In  addition,  an  algorithm  is  proposed  to  compute  the  eligible  times  and  the 
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deadlines  of  a  packet  at  subsequent  nodes  when  the  packet  enters  a  virtual  path.  We  note  that  by 
doing  this  and  using  DPS  to  carry  this  information  in  the  packets’  headers,  it  is  possible  to  remove 
per  path  scheduling  state  from  core  nodes.  However,  unlike  our  solution,  SCED+  does  not  provide 
per  flow  delay  differentiation  within  an  aggregate.  In  addition,  the  SCED+  work  focuses  on  the  data 
path  mechanism,  while  we  addresses  both  data  path  and  control  path  issues. 

5.7  Summary 

In  this  chapter,  we  have  described  two  distributed  algorithms  that  implement  QoS  scheduling  and 
admission  control  in  a  SCORE  network.  Combined,  these  two  algorithms  significantly  enhance 
the  scalability  of  both  the  data  and  control  planes,  while  providing  guaranteed  services  with  flex¬ 
ibility,  utilization,  and  assurance  levels  similar  to  those  traditionally  implemented  with  per  flow 
mechanisms.  The  key  technique  used  in  both  algorithms  is  Dynamic  Packet  State  (DPS),  which 
provides  lightweight  and  robust  means  for  routers  to  coordinate  actions  and  implement  distributed 
algorithms.  By  presenting  a  design  and  prototype  implementation  of  the  proposed  algorithms  in 
IPv4  networks,  we  have  demonstrated  that  it  is  indeed  possible  to  apply  DPS  techniques  and  have 
minimum  incompatibility  with  existing  protocols. 
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Chapter  6 


Providing  Relative  Service  Differentiation  in  SCORE 


In  this  chapter  we  desciibe  a  third  application  of  the  DPS  technique:  implementing  a  large  spatial 
granularity  network  service,  called  Location  Independent  Resource  Accounting  (LIRA),  that  pro¬ 
vides  relative  service  differentiation.  Unlike  traditional  services,  such  as  the  Guaranteed  service, 
that  are  defined  on  a  per  flow  basis,  large  spatial  granularity  seiwices  are  defined  over  a  large  num¬ 
ber  of  destinations.  A  simple  example  would  be  to  guarantee  a  user  10  Mbps  bandwidth  irrespective 
of  where  or  when  the  user  sends  traffic. 

With  LIRA,  each  user  is  assigned  a  rate  at  which  it  receives  resource  tokens.  For  each  LIRA 
packet,  a  user  is  charged  a  number  of  resource  tokens,  the  amount  depending  on  the  congestion 
level  along  the  packet’s  path.  The  goal  of  LIRA  is  to  achieve  both  high  resource  utilization  and  very 
low  loss  rate.  LIRA  provides  relative  differentiation:  a  user  which  receives  twice  as  many  resource 
tokens  as  another  user  will  receive  about  twice  as  much  bandwidth,  as  long  as  both  users  share  the 
same  links.  Note  that  in  the  case  of  one  link,  LIRA  reduces  to  Weighted  Fair  Queueing,  i.e.,  each 
active  user  is  allocated  a  capacity  that  is  proportional  to  the  rate  at  which  it  receives  resource  tokens. 

We  present  an  integrated  set  of  algorithms  that  implement  the  LIRA  service  model  in  a  SCORE 
network.  Specifically,  we  leverage  the  existing  routing  infrastructure  to  distribute  the  path  costs  to 
all  edge  nodes.  Since  the  path  cost  reflects  the  congestion  level  along  the  path,  we  use  this  cost  to 
design  dynamic  routing  and  load  balancing  algorithms.  To  avoid  packet  re-ordering  within  a  flow, 
we  devise  a  lightweight  mechanism  based  on  DPS  that  binds  a  flow  to  a  route  so  that  all  packets 
from  the  flow  will  traverse  the  same  route.  To  reduce  route  oscillation,  we  probabilistically  bind  a 
flow  to  one  of  the  multiple  routes. 

Traditional  solutions  to  bind  a  flow  to  a  route,  also  known  as  route-pinning,  require  routers  to 
either  maintain  per  flow  state,  or  maintain  state  that  is  proportional  with  the  square  of  the  number 
of  edge  routers.  By  using  DPS,  we  are  able  to  significantly  reduce  this  complexity.  In  particular, 
we  propose  a  route-pinning  mechanism  that  requires  routers  to  maintain  state  which  is  proportional 
only  to  the  number  of  egress  routers. 

The  rest  of  the  chapter  is  organized  as  follows.  The  next  section  motivates  the  LIRA  service 
model,  and  discusses  the  limitation  of  the  existing  alternatives.  Section  6.3  describes  the  LIRA  ser¬ 
vice  and  outlines  its  implementation  in  a  SCORE  network.  Section  6.4  presents  simulation  experi¬ 
ments  to  demonstrate  the  effectiveness  of  our  solution.  Section  6.5  justifies  the  new  service  model 
and  discusses  possible  ways  for  our  scheme  to  implement  other  differential  service  models.  Finally, 
in  Section  6.6  we  present  the  related  work,  and  in  Section  6.7  we  summarize  our  contributions. 


87 


6.1  Background 


Traditional  service  models  that  propose  to  enhance  the  best-effort  service  are  usually  defined  on  a 
per-flow  basis.  Examples  of  such  services  are  the  Guaranteed  and  Controlled  Load  services  [93, 
121]  proposed  in  the  context  of  Int.serv  [82],  and  the  Premium  service  [76]  proposed  in  the  context 
of  Diffserv  [32],  While  these  services  provide  excellent  support  for  a  plethora  of  new  point-to- 
point  applications  such  as  IP  telephony  and  remote  diagnostics,  there  is  a  growing  need  to  support 
services  at  a  coarser  granularity  than  at  a  flow  granularity.  An  example  would  be  a  service  that  is 
defined  in-espective  of  where  or  when  a  user  sends  its  traffic.  Such  a  service  would  be  much  easier 
to  negotiate  by  an  organization  since  there  is  no  need  to  specify  the  destinations  in  advance,  usually 
a  daunting  task  in  practice.' 

An  example  of  such  a  service  is  the  Assured  service  which  was  recently  proposed  by  Clark  and 
Wroclawski  [23,  24]  in  the  context  of  Diffserv'.  With  the  Assured  service,  a  fixed  bandwidth  profile 
is  associated  with  each  user.  This  profile  describes  the  commitment  of  the  Internet  Service  Provider 
(ISP)  to  the  user.  In  particular,  the  user  is  guaranteed  that  as  long  as  its  aggregate  traffic  does  not 
exceed  its  profile,  all  user's  packets  are  delivered  to  their  destinations  with  veiy  high  probability. 
In  the  remainder  of  this  chapter,  we  use  the  term  of  seirice  assurance  to  denote  the  probability 
with  which  a  packet  is  delivered  to  its  destination.  If  a  user  exceeds  its  profile,  the  excess  traffic  is 
forwarded  as  best-effort  traffic.  Note  that  the  implicit  assumption  in  the  Assured  service  is  that  the 
the  traffic  sent  within  a  user  profile  has  a  much  higher  assurance  (i.e.,  its  packets  are  delivered  with 
a  much  higher  probability)  than  the  best  effort  traffic. 

In  this  chapter,  we  propose  a  novel  service,  called  Location  Independent  Resource  Accounting 
(LIRA),  in  which  the  service  profile  is  described  in  terms  of  resource  tokens  rather  than  fixed  band¬ 
width  profile.  In  particular,  with  LIRA,  each  user  is  assigned  a  token  bucket  in  which  it  receives 
resource  tokens  at  a  fixed  rate.  When  a  user  sends  a  packet  into  the  network,  the  user  is  charged  a 
number  of  resource  tokens,  the  amount  depending  on  the  congestion  level  along  the  path  traversed 
by  the  packet.  If  the  user  does  not  have  enough  resource  tokens  in  its  token  bucket,  the  packet  is 
forwarded  as  a  best  effort  packet.  Note  that,  unlike  the  Assured  service  which  provides  an  absolute 
service,  LIRA  provides  a  relative  service.  In  particular,  if  a  user  receives  resource  tokens  at  a  rate 
that  is  twice  the  rate  of  another  user,  and  if  both  users  sent  traffic  along  the  same  paths,  the  first  user 
will  get  twice  as  much  aggregate  throughput. 

A  natural  question  is  why  use  a  service  that  offers  only  relative  bandwidth  differentiation  such 
as  LIRA,  instead  of  a  service  that  offers  a  fixed  bandwidth  profile  such  as  the  Assured  service? 
After  all,  the  Assured  service  arguably  provides  a  more  powerful  and  useful  abstraction;  ideally,  a 
user  is  guaranteed  a  fixed  bandwidth  irrespective  of  where  or  when  it  sends  traffic.  In  contrast,  with 
LIRA,  the  amount  of  traffic  a  user  can  send  varies  as  a  result  of  the  congestion  along  the  paths  to 
the  destination. 

The  simple  answer  is  that,  while  the  fixed  bandwidth  profile  is  arguably  more  powerful,  it  is 
unclear  whether  it  can  be  efficiently  implemented.  The  main  problem  follows  directly  from  the 
service  definition,  as  a  fixed  bandwidth  profile  service  does  not  put  any  restriction  on  where  or 
when  a  user  can  send  traffic.  This  results  in  a  fundamental  conflict  between  maximizing  resource 
utilization  and  achieving  a  high  service  assurance.  In  particular,  since  the  network  does  not  know  in 

'For  example,  for  a  Web  content  provider,  it  is  very  hard  if  not  impossible  to  specify  its  clients  a  priori. 
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advance  where  the  packets  will  go,  in  order  to  provide  high  service  assurance,  it  needs  to  provision 
enough  resources  to  all  possible  destinations.  In  the  worst  case,  when  the  traffic  of  all  users  traverses 
the  same  congested  link  in  the  network,  an  ISP  has  to  make  sure  that  the  sum  of  all  user  profiles 
does  not  exceed  the  capacity  of  the  bottleneck  link.  Unfortunately,  this  will  result  in  severe  resource 
underutilization,  which  is  unacceptable  in  practice.  Alternatively,  an  ISP  can  provision  resources  for 
the  average  rather  than  the  worst  case  scenaiio.  Such  a  strategy  will  increase  the  resource  utilization 
at  the  expense  of  service  assurance. 

In  contrast,  LIRA  can  achieve  both  high  service  assurance  and  resource  utilization.  However, 
to  achieve  this  goal,  LIRA  gives  up  the  fixed  bandwidth  profile  semantics.  The  bandwidth  profile 
of  a  user  depends  on  the  congestion  in  the  network:  the  more  congested  the  network,  the  lower 
the  profile  of  a  user.  Thus,  while  the  Assured  service  trades  the  service  assurance  for  resource 
utilization,  LIRA  trades  the  fixed  bandwidth  service  semantics  for  resource  utilization.  Next,  we 
argue  that  the  trade-off  made  by  LIRA  gives  a  user  better  control  on  managing  its  traffic,  which 
makes  LIRA  a  compelling  alternative  to  the  Assured  seiwice. 

To  illustrate  this  point,  consider  the  case  when  the  network  becomes  congested.  In  this  case, 
LIRA  tries  to  maintain  the  level  of  service  assurance  by  scaling  back  the  profiles  of  all  users  that 
send  ti*affic  in  the  congested  portion  of  the  network.  In  contrast,  in  the  case  of  the  Assured  service, 
network  congestion  will  cause  a  decrease  of  the  service  assurance  for  all  users  that  share  the  con¬ 
gested  portion  of  the  network.  Consider  now  a  company  whose  traffic  profile  decreases  from  10 
to  5  Mbps,  as  a  result  of  network  congestion.  Similarly,  assume  that,  in  the  case  of  the  Assured 
service,  the  same  company  experiences  a  ten  fold  increase  in  its  loss  rate  as  the  result  of  the  network 
congestion  (while  its  service  profile  remains  constant  at  10  Mbps).  Finally,  assume  that  the  CEO 
of  the  company  wants  to  make  an  urgent  video  conference  call,  for  which  requires  2  Mbps.  With 
LIRA,  since  the  bandwidth  required  by  the  video  conference  is  no  larger  than  the  company’s  traffic 
profile,  the  CEO  can  initiate  the  conference  immediately.  In  contrast,  with  the  Assured  service,  the 
CEO  may  not  be  able  to  start  the  conference  due  to  the  high  loss  rate.  Worse  yet,  if  the  congestion  is 
caused  by  the  traffic  of  other  users,  the  company  can  do  nothing  about  it.  The  fundamental  problem 
is  that,  unlike  LIRA,  the  Assured  service  does  not  provide  any  protection  in  case  of  congestion. 


6.2  Solution  Outline 

We  consider  the  implementation  of  LIRA  in  a  SCORE  network,  in  which  we  use  a  two  bit  encoding 
scheme.  The  first  bit,  called  the  preferred  bit,  is  set  by  the  application  or  user  and  indicates  the 
dropping  preference  of  the  packet.  The  second  bit,  called  marking  bit,  is  set  by  the  ingress  routers 
of  an  ISP  and  indicates  whether  the  packet  is  in-  or  out-of-profile.  When  a  preferred  packet  arrives 
at  an  ingress  node,  the  node  marks  it,  if  the  user  has  not  exceeded  its  profile;  otherwise  the  packet 
is  left  unmarked.^  The  reason  to  use  two  bits  instead  of  one  is  that  in  an  Internet  environment  with 
multiple  ISPs,  even  if  a  packet  may  be  out-of-profile  in  some  ISPs  on  the  earlier  portion  of  its  path, 
it  may  still  be  in-profile  in  a  subsequent  ISP.  Having  a  dropping  bit  that  is  unchanged  by  upstream 
ISPs  on  the  path  will  allow  downstream  ISPs  to  make  the  correct  decision.  Core  routers  implement 
a  simple  behavior  of  priority-based  dropping.  Whenever  there  is  a  congestion,  a  core  router  always 

^In  this  chapter,  we  will  use  the  terminology  of  marked  or  unmarked  packets  to  refer  to  packets  in  or  out-of  the  service 
profile,  respectively. 
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drops  unmarked  packets  first.  In  this  chapter,  we  focus  on  mechanisms  for  implementing  LIRA  in 
a  single  ISR  We  assume  the  following  model  for  the  interaction  of  multiple  ISPs:  if  ISP  A  is  using 
the  service  of  ISP  B,  then  ISP  B  will  treat  ISP  A  just  like  a  regular  user.  In  particular,  the  traffic 
from  all  ISP  A's  users  will  be  treated  as  a  single  traffic  aggregate. 

While  the  above  forwarding  scheme  can  be  easily  implemented  in  a  Diffscrv  network,  it  turns 
out  that  to  effectively  support  LIRA  we  need  the  ability  to  perform  route-pinning,  that  is,  to  bind 
a  flow  to  a  route  so  that  all  packets  from  the  flow  will  traverse  the  same  path.  Unfortunately, 
traditional  mechanisms  to  achieve  route  pinning  require  per  flow  state.  Even  the  recently  proposed 
Multi  Protocol  Label  Switching  (MPLS)  requires  routers  to  maintain  an  amount  of  state  proportional 
to  the  square  of  the  number  of  edge  routers.  In  a  large  domain  with  thousands  of  edge  nodes  such 
overhead  may  be  unacceptable. 

To  address  this  problem  we  use  the  Dynamic  Packet  State  (DPS)  technique.  With  each  packet 
we  associate  a  label  that  encodes  the  packet's  route  from  the  cuirent  node  to  the  egress  router.  Packet 
labels  are  initialized  by  ingress  routers,  and  are  used  by  core  routers  to  route  the  packets.  When  a 
packet  is  forwarded,  the  router  updates  its  label  to  reflect  the  fact  that  the  remaining  path  has  been 
reduced  by  one  hop.  By  using  this  scheme,  we  are  able  to  significantly  reduce  the  state  maintained 
by  core  routers.  More  precisely,  this  state  becomes  proportional  to  the  number  of  egress  nodes 
reachable  from  the  core  router,  which  can  be  shown  to  be  optimal.  The  route  pinning  mechanism  is 
described  in  detail  in  Section  6.3.4. 


6.3  LIRA:  Service  Differentiation  based  on  Resource  Right  Tokens 

In  this  section,  we  present  our  differential  service  model,  called  LIRA  (Location  Independent  Re¬ 
source  Accounting),  with  service  profiles  defined  in  terms  of  resource  tokens  rather  than  absolute 
amounts  of  bandwidth. 

With  LIRA,  each  user  i  is  assigned  a  service  profile  that  is  characterized  by  a  resource  token 
bucket  (r'i,  bi),  where  represents  the  resource  token  rate,  and  represents  the  depth  of  the  bucket. 
Unlike  traditional  token  buckets  where  each  preferred  bit  entering  the  network  consumes  exactly 
one  token,  with  resource  token  buckets  the  number  of  tokens  needed  to  admit  a  prefeired  bit  is  a 
dynamic  function  of  the  path  it  traverses. 

Although  there  are  many  functions  that  can  be  used,  we  consider  a  simple  case  in  which  each 
link  i  is  assigned  a  cost,  denoted  Ci{t),  which  represents  the  amount  of  resource  tokens  charged  for 
sending  a  marked  bit  along  the  link  at  time  t.  The  cost  of  sending  a  marked  packet  is  computed  as 
Y^ieP  ^  ^  where  L  is  the  packet  length  and  P  is  the  set  of  links  traversed  by  the  packet.  While 
we  focus  on  unicast  communications  in  this  chapter,  we  note  that  the  cost  function  is  also  naturally 
applicable  to  the  case  of  multicast.  As  we  will  show  in  Section  6.4,  charging  a  user  for  every  link 
it  uses  and  using  the  cost  in  routing  decisions  helps  to  increase  the  network  throughput.  In  fact,  it 
has  been  shown  by  Ma  et  al.  [68]  that  using  a  similar  cost  function^  for  performing  the  shortest  path 
routing  gives  the  best  overall  results  when  compared  with  other  dynamic  routing  algorithms. 

It  is  important  to  note  that  the  costs  used  here  are  not  monetary  in  nature.  Instead  they  are 
reflecting  the  level  of  congestion  and  the  resource  usage  along  links/paths.  This  is  different  from  a 

can  be  shown  that  when  all  links  have  the  same  capacity  our  cost  is  within  a  constant  factor  from  the  cost  of 
shortest-dist(P,  1)  algorithm  proposed  by  Ma  et  al.  [68]. 
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■  Ingress  Router  Algorithm  — i 

upon  packet  p  arrival:  | 

bitjcost  =  cl  +  c2  +  c3  +  c4;  j 
cost(p)  =  Iength(p)'^bit_cost;  | 
if  (preferred(p)  and  L  >  cost(p))  I 
mark(p);  | 

L^L-cost{p);  I 


Figure  6.1:  When  a  preferred  packet  arrives,  the  node  computes  the  packet’s  cost,  and  the  packet  is  marked 
if  there  are  sufficient  resource  tokens. 

pricing  scheme  which  represents  the  amount  of  payment  made  by  an  individual  user.  Though  costs 
can  provide  valuable  input  to  pricing  policies,  in  general,  there  is  no  necessary  direct  connection 
between  cost  and  price. 

Figure  6.1  illustrates  the  algorithm  performed  by  ingress  nodes.  When  a  preferred  packet  arrives 
at  an  ingress  node,  the  node  computes  its  cost  based  on  the  packet  length  and  the  path  it  traverses.  If 
the  user  has  enough  resource  tokens  in  its  bucket  to  cover  this  cost,  the  packet  is  marked,  admitted 
in  the  network,  and  the  corresponding  number  of  resource  tokens  is  subtracted  from  the  bucket 
account.  Otherwise,  depending  on  the  policy,  the  packet  can  be  either  dropped,  or  treated  as  best 
effort.  Informally,  our  goal  at  the  user  level  is  to  ensure  that  users  with  “similar”  communication 
patterns  receive  service  (in  terms  of  aggregate  marked  traffic)  in  proportion  to  their  token  rates. 

The  crux  of  the  problem  then  is  the  computation  and  distribution  of  the  per  marked  bit  cost 
for  each  path.  In  this  section,  we  first  present  the  algorithm  to  compute  the  cost  of  each  marked 
bit  for  a  single  link,  and  next  present  an  algorithm  that  computes  and  distributes  the  per-path  cost 
of  one  marked  bit  by  leveraging  existing  routing  protocols.  We  then  argue  that  this  dynamic  cost 
information  is  also  useful  for  multi-path  routing  and  load  balancing  purposes.  To  avoid  route  oscil¬ 
lation  and  packet  reordering  within  one  application-level  flow,  we  introduce  two  techniques.  First,  a 
lightweight  scheme  is  devised  to  ensure  that  all  packets  from  the  same  application-level  flow  always 
travel  the  same  path.  The  scheme  is  lightweight  in  the  sense  that  no  per  flow  state  is  needed  in  any 
core  routers.  Second,  rather  than  using  a  simple  greedy  algorithm  that  always  selects  the  path  with 
the  current  lowest  cost,  we  use  a  probabilistic  scheme  to  enhance  system  stability. 

63.1  Link  Cost  Computation 

A  natural  goal  in  designing  the  link  cost  function  in  LIRA  is  to  avoid  dropping  marked  packets. 
Since  in  the  worst  case  all  users  can  compete  for  the  same  link  at  the  same  time,  a  sufficient  condition 
to  avoid  this  problem  is  to  have  a  cost  function  that  exceeds  the  number  of  tokens  in  the  system 
when  the  link  utilization  approaches  unity.  Without  bounding  the  number  of  tokens  in  the  system. 


r  =  resource  token  rate 
L  =  current  bucket  level 
ci  =  per  bit  cost  of  link  i 
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this  suggests  a  cost  function  that  goes  to  infinity  when  the  link  utilization  approaches  unity.  Among 
many  possible  cost  functions  that  exhibit  this  property,  we  choose  the  following  one: 

a 


cit)  - 


(6.1) 


where  a  is  ihc  fixed  cost  of  using  the  link‘d  when  it  is  idle,  and  represents  the  link  utilization  at 
time  t.  In  particular,  ?/.(/)  =  R(t)/C\  were  R(t)  is  the  traffic  throughput  at  time  /,  and  C  represents 
the  link  capacity.  Recall  that  (‘{t)  is  measured  in  tokens/bit  and  represents  how  much  a  user  is 
charged  for  sending  a  marked  bit  along  that  link  at  time  /. 

In  an  ideal  system,  where  costs  arc  instantaneously  distributed  and  the  rate  of  the  incoming 
traffic  varies  slowly,  a  cost  function  as  defined  by  Eq.  (6.1)  guarantees  that  no  marked  packets  are 
dropped  inside  the  core.  However,  in  a  real  system,  computing  and  distributing  the  cost  information 
incur  overhead,  so  they  are  usually  done  periodically.  In  addition,  there  is  always  the  issue  of 
propagation  delay.  Because  of  these,  the  cost  information  used  in  admitting  packets  at  ingress  nodes 
may  be  obsolete.  This  may  cause  packet  dropping,  and  lead  to  oscillations.  Though  oscillations 
are  inherent  to  any  system  in  which  the  propagation  of  the  feed-back  information  is  non-zero,  the 
sensitivity  of  our  cost  function  when  the  link  utilization  approaches  unity  makes  things  worse.  In 
this  regime,  an  incrementally  small  traffic  change  may  result  in  an  arbitraiy  large  cost  change.  In 
fact  one  may  note  that  Eq.  (6. 1 )  is  similar  to  the  equation  describing  the  delay  behavior  in  queueing 
systems  [65],  which  is  known  to  lead  to  system  instability  when  used  as  a  congestion  indication  in 
a  heavily  loaded  system. 

To  address  these  issues,  we  use  the  following  iterative  fomiula  to  compute  the  link  cost: 

c[tn  +  - - - .  (6.2) 

where  R{i! ,  ifi)  denotes  the  average  bit  rate  of  the  marked  traffic  during  the  time  interval  [/',  //').  It 
is  easy  to  see  that  if  the  marked  traffic  rate  is  constant  and  equal  to  /?,  the  above  iteration  converges 
to  the  cost  given  by  Eq.  (6.1).  The  main  advantage  of  using  Eq.  (6.2)  over  Eq.  (6.1)  is  that  it  is 
more  robust  against  large  variations  in  the  link  utilization.  In  particular,  when  the  link  utilization 
approaches  unity  the  cost  increases  by  at  most  a  every  iteration.  In  addition,  unlike  Eq.  (6.1), 
Eq.  (6.2)  is  well  defined  even  when  the  link  is  congested,  i.e.,  ,  //)  =  C, 

Unfortunately,  computing  the  cost  by  using  Eq.  (6.2)  is  not  as  accurate  as  by  using  Eq.  (6.1). 
The  link  may  become  and  remain  congested  for  a  long  time  before  the  cost  increase  is  large  enough 
to  reduce  the  arrival  rate  of  marked  bits.  This  may  result  in  the  loss  of  marked  packets,  which  we 
try  to  avoid.  To  alleviate  this  problem  we  use  only  a  fraction  of  the  link  capacity,  C  =  ftC,  for  the 
marked  traffic,  the  remaining  being  used  to  absorb  the  unexpected  variations  due  to  inaccuracies  in 
the  cost  estimation.^"'  Here,  we  chose  /3  between  0.85  and  0.9. 


6.3.2  Path  Cost  Computation  and  Distribution 

In  LIRA,  the  cost  of  a  marked  bit  over  a  path  is  the  sum  of  the  costs  of  a  marked  bit  over  each  link 
on  the  path.  Once  the  cost  for  each  link  is  computed,  it  is  easy  to  compute  and  distribute  the  path 

"^In  practice,  the  network  administrator  can  make  use  of  a  to  encourage/discourage  the  use  of  the  link.  Simply  by 
changing  the  fixed  cost  a,  a  link  will  cost  proportionally  more  or  less  at  the  same  utilization. 

is  similar  to  the  pressure  factor  used  in  some  ABR  congestion  control  schemes  for  estimating  the  fair  share  [61 ,  87]. 
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Figure  6.2:  Example  of  route  binding  via  packet  labeling. 

cost  by  leveraging  existing  routing  protocols.  For  link  state  algorithms,  the  cost  of  each  marked  bit 
can  be  included  as  part  of  the  link  state.  For  distance  vector  algorithms,  we  can  pass  and  compute 
the  partial  path  cost  in  the  same  way  the  distance  of  a  partial  path  is  computed  with  respect  to  the 
routing  metric. 

6.3.3  Multipath  Routing  and  Load  Balancing 

Since  our  algorithm  defines  a  dynamic  cost  function  that  reflects  the  congestion  level  of  each  link, 
it  is  natural  to  use  this  cost  function  for  the  purpose  of  multi-path  routing.  To  achieve  this,  we 
compute  the  k  shortest  paths  for  each  destination  or  egress  node  using  the  unit  link  metric.  While 
the  obvious  solution  is  to  send  packets  along  the  path  with  the  minimum  cost  (in  the  sense  of  LIRA, 
see  Section  6.3)  among  the  k  paths,  this  may  introduce  two  problems:  (a)  packet  re-ordering  within 
one  application-level  flow,  which  may  negatively  affect  end-to-end  congestion  control  algorithms, 
and  (b)  route  oscillation,  which  may  lead  to  system  instability. 

We  introduce  two  techniques  to  address  these  problems.  First,  we  present  a  lightweight  mech¬ 
anism  that  binds  a  flow  to  a  route  so  that  all  packets  from  the  flow  will  traverse  the  same  route. 
Second,  to  reduce  route  oscillation,  for  each  new  flow,  an  ingress  node  probabilistically  binds  it  to 
one  of  the  multiple  routes.  By  carefully  selecting  the  probability,  we  can  achieve  both  stability  and 
load-balancing. 

6.3.4  Route  Pinning 

As  discussed  earlier,  we  will  maintain  multiple  routes  for  each  destination.  However,  we  would 
like  to  ensure  that  all  packets  belonging  to  the  same  flow  are  forwarded  along  the  same  path.  To 
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implement  this  mechanism  we  use  the  Dynamic  Packet  State  (DPS)  technique. 

The  basic  idea  is  to  associate  with  each  path  a  label  computed  as  the  XOR  over  the  identifiers  of 
all  routers  along  the  path,  and  then  associate  this  label  with  each  packet  of  a  flow  that  goes  along  that 


path.  Here  we  use  the  IP  address  as  the  identifier.  More  precisely,  a  path  P  —  (u/q,  id\. ,  idn), 
where  ido  is  the  source  and  idn  is  the  destination,  is  encoded  at  the  source  (ido)  by  /q  =  id]  ® 
id2  ®  . . .  0  id-ff.  Similarly,  the  path  from  id\  to  idj^  is  encoded  at  id\  by  /]  =  id‘2  0  . . .  ®  id,).  A 


packet  that  travels  along  path  P  is  labeled  with  lo  as  it  is  leaving  ido,  and  with  1]  as  it  is  leaving  d[ . 
By  using  XOR  we  can  iteratively  re-compute  the  label  based  on  the  packet's  current  label  and  the 
node  identifier.  As  an  example,  consider  a  packet  that  is  assigned  label  /q  at  node  ido.  When  the 
packet  airives  at  node  id[ ,  the  new  label  corresponding  to  the  remaining  of  the  path,  {id[,. . . ,  id„), 
is  computed  as  follows: 

=  id[®lo~  (6.3) 

idi  0  {id\  0  id.2  0  ...  0  id,) )  =  u/2  0  . . .  0  id,, . 

It  is  easy  to  see  that  this  scheme  guarantees  that  the  packet  will  be  forwarded  exactly  along  the  path 
P.  Here,  we  implicitly  assume  that  all  alternate  paths  between  two  end-nodes  have  unique  labels. 
Although  theoretically  there  is  a  non-zero  probability  that  two  labels  may  collide,  we  believe  that 
for  practical  purposes  it  can  be  neglected.  One  possible  way  to  reduce  the  label  collision  probability 
would  be  to  use  a  hash  function  to  translate  the  IP  addresses  into  labels.  By  using  a  good  hash 
function,  this  will  result  in  a  more  random  distribution  of  router  labels.  Another  possibility  would 
be  to  explicitly  label  routers  to  reduce  or  even  eliminate  the  collision  probability.  Note  that  this 
solution  will  require  to  maintain  the  mapping  between  router  IP  addresses  and  router  labels,  which 
can  be  difficult  in  practice.  One  last  point  worth  noting  is  that  even  if  two  alternate  paths  have  the 
same  label,  this  will  not  jeopardize  the  con*ectness  of  our  scheme:  the  worst  thing  that  can  happen 
is  an  alternate  path  to  be  ignored,  which  will  only  lead  to  a  decrease  in  utilization. 

Next  we  give  some  details  of  how  this  mechanism  can  be  implemented  by  simply  extending  the 
information  maintained  by  each  router  in  the  routing  and  forwarding  tables.  Besides  the  destination 
and  the  route  cost,  each  entry  in  the  routing  table  also  contains  the  label  associated  with  that  path. 

<  dst,<  >....  <  >>  (6.4) 

Similarly,  the  forwarding  table  should  contain  an  entry  for  each  path: 

<  lM\dst^nextJiop^^^  >  ...  <  l^^\dst.next-hop^^^  >  (6.5) 

In  Figure  6.2  we  give  a  simple  example  to  illustrate  this  mechanism.  Assume  that  nodes  id\  and 
id^  are  edge  nodes,  and  there  are  two  possible  paths  from  id\  to  idr^  of  costs  7,  and  8,  respectively. 
Now,  assume  a  packet  destined  to  id-^  arrives  at  id\ .  First  the  ingress  node  idy  searches  the  classifier 
table  (not  shown  in  the  Figure)  that  maintains  a  list  of  all  flows  to  see  whether  this  is  the  first  packet 
of  a  flow.  If  it  is,  the  router  uses  the  information  in  the  routing  table  to  probabilistically  bind  the 
flow  to  a  path  to  idry.  At  the  same  time  it  labels  the  packet  with  the  encoding  of  the  selected  route. 
In  our  example,  assume  the  path  of  cost  7,  i.e.,  {idy ,  id^^  id:h  ids),  is  selected.  If  the  arriving  packet 
is  not  the  first  packet  of  the  flow,  the  router  automatically  labels  the  packet  with  the  encoding  of  the 
path  to  which  the  flow  is  bound.  This  can  be  simply  achieved  by  keeping  a  copy  of  the  label  in  the 
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classifier  table.  Once  the  packet  is  labeled,  the  router  checks  the  forwarding  table  for  the  next  hop 
by  matching  the  packet’s  label  and  its  destination.  In  our  case,  this  operation  gives  us  id2  as  the  next 
hop.  When  the  packet  anives  at  node  ida  the  router  first  computes  a  new  label  based  on  the  current 
packet  label  and  the  router  identifier:  label  =  id2  ®  laheL  The  new  label  is  then  used  to  lookup  the 
forwarding  table. 

It  is  important  to  note  that  the  above  algorithm  assumes  per  flow  state  only  at  ingress  nodes. 
Inside  the  core,  there  is  no  per  flow  state.  Moreover,  the  labels  can  speed-up  the  table  lookup  if  used 
as  hash  keys. 

6.3.5  Path  Selection 

While  the  above  forwarding  algorithm  ensures  that  all  packets  belonging  to  the  same  flow  traverse 
the  same  path,  there  is  still  the  issue  of  how  to  select  a  path  for  a  new  flow.  The  biggest  concern 
with  any  dynamic  routing  protocol  based  on  congestion  information  is  its  stability.  Frequent  route 
changes  may  lead  to  oscillations. 

To  address  this  problem,  we  associate  a  probability  with  each  route  and  use  it  in  binding  a  new 
flow  to  that  route.  The  goal  in  computing  this  probability  is  to  equalize  the  costs  along  the  alternate 
routes,  if  possible.  For  this  we  use  a  greedy  algorithm.  Every  time  the  route  costs  are  updated 
we  split  the  set  of  routes  in  two  equal  sets,  where  all  the  routes  in  one  set  have  costs  larger  than 
the  routes  in  the  second  set.  If  there  is  an  odd  number  of  routes,  we  leave  the  median  out.  Then, 
we  decrease  the  probability  of  every  route  in  the  first  set,  the  one  which  contains  the  higher  cost 
routes,  and  increase  the  probability  of  each  route  in  the  second  set  by  a  small  constant  6.  It  can  be 
shown  that  in  a  steady-state  system,  this  algorithm  converges  to  the  desired  solution,  in  which  the 
difference  between  the  costs  of  the  two  alternate  paths  is  bounded  by  6. 

6.3.6  Scalability 

As  described  so  far,  it  is  required  that  our  scheme  maintains  k  entries  for  each  destination  in  both  the 
forwarding  table  used  by  the  forwarding  engine  and  the  routing  table  used  by  the  routing  protocol, 
where  k  is  the  maximum  number  of  alternate  paths.  While  this  factor  may  not  be  significant  if  k 
is  small,  a  more  serious  issue  that  potentially  limits  the  scalability  of  the  algorithm  is  that  in  its 
basic  form  it  requires  that  an  entry  be  maintained  for  each  destination,  where  in  reality,  to  achieve 
scalability,  routers  really  maintain  the  longest-prefix  of  a  group  of  destinations  that  share  the  same 
route  [41].  Since  our  algorithm  works  in  the  context  of  one  ISP,  we  can  maintain  an  entry  for  each 
egress  node  instead  of  each  destination.  We  believe  this  is  sufficient  as  the  number  of  egress  nodes 
in  an  ISP  is  usually  not  large. 

However,  assume  that  the  number  of  egress  nodes  in  an  ISP  is  very  large  so  that  significant 
address  aggregation  is  needed.  Then  we  need  to  also  perform  cost  aggregation.  To  illustrate  the 
problem  consider  the  example  in  Figure  6.3.  Assume  the  addresses  of  do  and  di  are  aggregated 
at  an  intermediate  router  ri.  Now  the  question  is  how  much  to  charge  a  packet  that  enters  at  the 
ingress  node  ro  and  has  the  destination  do.  Since  we  do  not  keep  state  for  the  individual  routes 
to  do,  and  di  respectively,  we  need  to  aggregate  the  cost  to  these  two  destinations.  In  doing  this, 
a  natural  goal  would  be  to  maintain  the  total  charges  the  same  as  in  a  reference  system  that  keeps 
per  route  state.  Let  di)  denote  the  average  traffic  rate  from  ri  to  di,  i  =  1,2,  Then,  in  the 


95 


Figure  6.3:  Topology  to  illustrate  the  label  and  cost  aggregation. 


reference  system  that  maintains  per  route  state,  the  total  charge  per  time  unit  for  the  aggregate  traffic 
from  vi  to  d{)  and  (l\  is:  cost{7'i,(li))B.{7'\,(li))  +  In  a  system  that  does  not 

maintain  per  route  state,  the  charge  for  the  same  traffic  is  msf (7*1 .  do.  r/i ) (/?(7“) ,  do)  +  R{7'\ ,  dj )), 
where  co.s^(7*j ,  do,  d] )  denotes  the  per  bit  aggregate  cost.  This  yields 


co,s/(7*i,do.di) 


co.s/(7-i.do)/?(7-i.d()) 
i?(n, do )  +  i?(ri.di)  ^ 
ms‘^(7’i .  d|  )B{r\ ,  d] ) 
i?(7'i.do)  +  /?(r*i.di)' 


(6.6) 


Thus,  any  packet  that  airives  at  7*0  and  has  either  destination  do  or  d[  is  charged  with  r*o.s‘/(7'o,  r\ )  + 
cost{7'i,do,d\).  Obviously,  route  aggregation  increases  the  inaccuracies  in  cost  estimation.  How¬ 
ever,  this  may  be  alleviated  by  the  fact  that  the  route  aggregation  usually  exhibits  high  localities. 

Another  problem  with  address  aggregation  is  that  a  label  can  no  longer  be  used  to  encode  the 
entire  path  to  the  destination.  Instead,  it  is  used  to  encode  the  common  portion  of  the  paths  to  the 
destinations  in  the  aggregate  set.  This  means  that  a  packet  should  be  relabeled  at  every  router  that 
performs  aggregation  involving  the  packet's  destination.  The  most  serious  concern  with  this  scheme 
is  that  it  is  necessaiy  to  maintain  per  flow  state  and  perform  packet  classification  at  a  core  router  (ri 
in  our  example).  Fortunately,  this  scalability  problem  is  alleviated  by  the  fact  that  we  need  to  keep 
per  flow  state  only  for  the  flows  whose  destination  addresses  are  aggregated  at  the  current  router. 
Finally,  we  note  that  this  problem  is  not  specific  to  our  scheme;  any  scheme  that  (i)  allows  multiple 
path  routing,  (ii)  performs  load  balancing,  and  (iii)  avoids  packet  reordering  has  to  address  it. 


6.4  Simulation  Results 

In  this  section  we  evaluate  our  model  by  simulation.  We  conduct  four  experiments:  three  involving 
simple  topologies  which  help  to  gain  a  better  understanding  of  the  behavior  of  our  algorithms,  and 
one  more  realistic  example  with  a  larger  topology  and  more  complex  traffic  patterns.  The  first 
experiment  shows  that  if  all  users  share  the  same  congested  path,  then  each  user  receives  service 
in  proportion  to  its  resource  token  rate.  This  is  the  same  result  one  would  expect  from  using  a 
weighted  fair  queueing  scheduler  on  every  link,  with  the  weights  set  to  the  users'  token  rate.  In 
the  second  experiment,  we  show  that  by  using  dynamic  routing  and  load  balancing,  we  are  able  to 
achieve  the  same  result  -  that  is,  each  user  receives  service  in  proportion  to  its  token  rate  -  in  a 
more  general  configuration  where  simply  using  weighted  fair  queueing  scheduler  on  every  link  is 
not  sufficient.  In  the  third  experiment,  we  show  how  load  balancing  can  significantly  increase  the 
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overall  resource  utilization.  Finally,  the  fourth  experiment  shows  how  the  behaviors  observed  in  the 
previous  experiments  scale  to  a  larger  topology. 

6A.1  Experiment  Design 

We  have  implemented  a  packet  level  simulator  which  supports  both  Distance  Vector  (DV)  and  Short¬ 
est  Path  First  (SPF)  routing  algorithms.  To  support  load  balancing  we  extended  these  algorithms  to 
compute  the  k-ih  shortest  paths.  The  time  interval  between  two  route  updates  is  uniformly  dis¬ 
tributed  between  0.5  and  1.5  of  the  average  value.  As  shown  by  Floyd  and  Jacobson  [38],  this 
choice  avoids  the  route-update  self-synchronization.  In  SPF,  when  a  node  receives  a  routing  mes¬ 
sage,  it  first  updates  its  routing  table  and  then  forwards  the  message  to  all  its  neighbors,  except  the 
sender.  The  routing  messages  are  assumed  to  have  high  priority,  so  they  are  never  lost.  In  the  next 
sections  we  compare  the  following  schemes: 

•  BASE  -  this  scheme  models  today’s  best-effort  Internet,  and  it  is  used  as  a  baseline  in  our 
comparison.  The  routing  protocol  uses  the  number  of  hops  as  the  distance  metric  and  it  is 
implemented  by  either  DV  or  SPF.  This  scheme  does  not  implement  service  differentiation, 
i.e.,  both  marked  and  unmarked  packets  are  identically  treated, 

•  STATIC  -  this  scheme  implements  the  same  static  routing  as  BASE.  In  addition,  it  implements 
LIRA  by  computing  the  link  cost  as  described  in  Section  6.3.1,  and  marking  packets  at  each 
ingress  node  according  to  the  algorithm  shown  in  Figure  6.1. 

•  DYNAMIC- A;  -  this  scheme  adds  dynamic  routing  and  load  balancing  to  STATIC.  The  rout¬ 
ing  protocol  uses  a  modified  version  of  DV/SPF  to  find  the  first  k  shortest  paths.  Note  that 
DYNAMIC- 1  is  equivalent  to  STATIC. 

Each  router  implements  a  FIFO  scheduling  discipline  with  a  shared  buffer  and  a  drop-tail  man¬ 
agement  scheme.  When  the  buffer  occupancy  exceeds  a  predefined  threshold,  newly  arrived  un¬ 
marked  packets  are  dropped.  Thus,  the  entire  buffer  space  from  the  threshold  up  to  its  total  size  is 
reserved  to  the  in-profile  traffic.^  Unless  otherwise  specified,  throughout  all  our  experiments  we  use 
a  buffer  size  of  256  KB  and  a  thieshold  of  64  KB. 

The  two  main  performance  indices  that  we  use  in  comparing  the  above  schemes  are  the  user 
in-profile  and  user  overall  throughputs.  The  user  in-profile  throughput  represents  the  rate  of  the  user 
aggregate  in-profile  traffic  delivered  to  its  destinations.  The  overall  throughput  represents  the  user’s 
entire  traffic  —  i.e.,  including  both  the  in-  and  out-of  profile  traffic  —  delivered  to  its  destinations. 
In  addition,  we  use  user  dropping  rate  of  the  in-profile  traffic  to  characterize  the  level  of  service 
assurance. 

Recent  studies  have  shown  that  the  traffic  in  real  networks  exhibits  the  self-similar  property  [27, 
80,  81,  118]  —  that  is,  the  traffic  is  bursty  over  widely  different  time  scales.  To  generate  self¬ 
similar  traffic  we  use  the  technique  originally  proposed  by  Willinger  et  al.  [118],  where  it  was 

^We  note  that  this  scheme  is  a  simplified  version  of  the  RIO  buffer  management  scheme  proposed  by  Clark  and 
Wroclawski  [24]  In  addition,  RIO  implements  a  Random  Early  Detection  (RED)  [37]  dropping  policy,  instead  of  drop- 
tail,  for  both  in-  and  out-of  profile  traffic.  RED  provides  an  efficient  detection  mechanism  for  the  adaptive  flows,  such  as 
TCP,  allowing  them  to  gracefully  degrade  their  performances  when  congestion  occurs.  However,  since  in  this  study  we 
are  not  concerned  with  the  behavior  of  individual  flows,  for  simplicity  we  chose  to  not  implement  RED. 
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Figure  6.4:  (a)  Topology  used  in  the  first  experiment.  Each  link  has  10  Mbps  capacity.  51,  52,  and  53  send 
all  their  traffic  to  DL  (b)  The  throughputs  of  the  three  users  under  BASE  and  STATIC  schemes,  (c)  The 
throughputs  under  STATIC  when  the  token  rate  of  52  is  twice  the  rate  of  51/52. 


shown  that  the  supeiposition  of  many  ON-OFF  flows  with  ON  and  OFF  periods  drawn  from  a  heavy 
tail  distribution,  and  which  have  fixed  rates  during  the  ON  period  results  in  self-similar  traffic.  In 
particular,  Willinger  et  al.  [118]  show  that  the  aggregation  of  several  hundred  of  ON-OFF  flows  is  a 
reasonable  approximation  of  the  real  end-to-end  traffic  observed  in  a  LAN. 

In  all  our  experiments,  we  generate  the  traffic  by  drawing  the  length  of  the  ON  and  OFF  periods 
from  a  Pareto  distribution  with  the  power  factor  of  1 .2.  During  the  ON  period  a  source  sends  packets 
with  sizes  between  100  and  1000  bytes.  The  time  to  send  a  packet  of  minimum  size  during  the  ON 
period  is  assumed  to  be  the  time  unit  in  computing  the  length  of  the  ON  and  OFF  intervals. 

Due  to  the  high  overhead  incurted  by  a  packet-level  simulator  such  as  ours,  we  limit  the  link 
capacities  to  10  Mbps  and  the  simulation  time  to  200  sec.  We  set  the  average  interval  between 
routing  updates  to  5  sec  for  the  small  topologies  used  in  the  first  three  experiments,  and  to  3  sec  for 
the  large  topology  used  in  the  last  experiment.  In  all  experiments,  the  traffic  starts  at  time  /  =  20 
sec.  The  choice  of  this  time  guarantees  that  the  routing  algorithm  finds  at  least  one  path  between 
any  two  nodes  by  time  t.  In  order  to  eliminate  the  transient  behavior,  we  start  our  measurements  at 
time  /  —  50  sec. 

6.4.2  Experiment  1:  Local  Fairness  and  Service  Differentiation 

This  experiment  shows  that  if  all  users  send  their  traffic  along  the  same  congested  path,  they  get 
service  in  proportion  to  their  token  rate,  as  long  as  there  is  enough  demand.  Consider  the  topology 
in  Figure  6.4(a),  where  users  51,  52,  and  53  send  traffic  to  Dl,  Figure  6.4(b)  shows  the  user  overall 
throughputs  over  the  entire  simulation  under  BASE.  As  it  can  be  seen,  51  gets  significantly  more 
than  the  other  two.  In  fact,  if  the  traffic  from  all  sources  were  continuously  backlogged,  we  expect 
that  51  will  get  half  of  the  congested  links  5  and  6,  while  52  and  53  split  the  other  half.  This 
is  because  even  though  each  user  sends  at  an  average  rate  higher  than  10  Mbs,  the  queues  are  not 
continuously  backlogged.  This  is  due  to  the  bursty  nature  of  the  traffic  and  due  to  the  limited  buffer 
space  at  each  router. 
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Figure  6.5;  (a)  Topology  used  in  the  second  experiment.  51,  52,  53,  and  54  send  all  their  traffic  to  Dl, 
D2,  and  £>3,  respectively,  (b)  The  throughputs  of  all  users  under  BASE,  STATIC,  and  DYNAMIC-2. 

Next,  we  run  the  same  simulation  for  the  STATIC  scheme.  To  each  user  we  assign  the  same 
token  rate,  and  to  each  link  we  associate  the  same  fixed  cost.  Figure  6.4(b)  shows  the  user  overall 
and  in-profile  throughputs.  Compared  to  BASE,  the  overall  throughputs  are  more  evenly  distributed. 
However,  the  user  51  still  gets  slightly  better  service,  i.e.,  its  in-profile  throughput  is  3.12  Mbps, 
while  the  in-profile  throughput  of  52/53  is  2.75  Mbps.  To  see  why,  recall  from  Eq.  (6.1)  that  link 
cost  accurately  reflects  the  level  of  congestion  on  that  link.  Consequently,  in  this  case  links  5  and  6 
will  have  the  highest  cost,  followed  by  link  4,  and  then  the  other  three  links.  Thus,  52  and  53  have 
to  “p^^y  ”  more  than  51  per  marked  bit.  Since  all  users  have  the  same  token  rates,  this  translates  into 
lower  overall  throughputs  for  52  and  53,  respectively. 

To  illustrate  the  relationship  between  the  user’s  token  rate  and  its  performance,  we  double  the 
token  rate  of  52.  Figure  6.4(c)  shows  the  overall  and  in-profile  throughputs  of  each  user.  In  terms  of 
in-profile  traffic,  user  52  gets  roughly  twice  the  throughout  of  53  (i.e.,  4.27  Mbps  vs.  2.18  Mbps). 

Finally,  we  note  that  there  were  no  marked  packets  dropped  in  any  of  the  above  simulations.  For 
comparison,  more  than  60%  of  the  out-of  profile  traffic  was  dropped. 

6.4.3  Experiment  2:  User  Fairness  and  Load  Balancing 

In  this  section  we  show  how  dynamic  routing  and  load  balancing  help  to  improve  user  level  fairness 
and  achieve  better  resource  utilization.  Consider  the  topology  in  Figure  6.5  where  users  51,  52,  53 
and  54  send  traffic  to  each  of  the  users  Dl,  D2  and  D3,  respectively.  Again  the  fixed  costs  of  all 
links  are  equal,  and  all  users  are  assigned  the  same  token  rate. 

Figure  6.5(b)  shows  the  overall  and  in-profile  throughputs  of  51,  52,  53  and  54  under  BASE, 
STATIC  and  DYNAMIC-2,  respectively.  When  BASE  and  STATIC  are  used,  each  user  always  sends 
along  the  shortest  paths.  This  results  in  51,  52  and  53  sharing  link  1,  while  54  alone  uses  link  3.  As 
a  consequence  54  receives  significantly  better  service  than  the  other  three  users.  Since  it  implements 
the  same  routing  algorithm,  STATIC  does  not  improve  the  overall  throughputs.  However,  compared 
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Figure  6.6:  (a)  Topology  used  in  the  third  experiment.  Mean  throughputs  when  (b)  load  is  balanced,  and  (c) 
when  it  is  unbalanced,  i.e,  S3  and  S4  are  inacti\'e. 


with  BASE,  STATIC  guarantees  that  in-profile  packets  are  delivered  with  veiy  high  probability 
(again,  in  this  experiment,  no  marked  packets  were  dropped).  On  the  other  hand,  when  DYNAMIC- 
2  is  used,  each  user  receives  almost  the  same  service.  This  is  because  users  *91,  S2  and  53  can 
now  use  both  routes  to  send  their  traffic,  which  allows  them  to  compete  with  user  54  for  link  3. 
User  54  still  maintains  a  slight  advantage,  but  now  the  difference  between  its  overall  throughput 
and  the  overall  throughputs  of  the  other  users  is  less  than  7%.  In  the  case  of  the  in-profile  traffic 
this  difference  is  about  5%.  As  in  the  previous  experiment,  the  reason  for  this  difference  is  because 
when  competing  with  54,  the  other  users  have  to  pay,  besides  link  3,  for  link  2  as  well. 

Thus,  by  taking  advantage  of  the  altemate  routes,  our  scheme  is  able  to  achieve  fairness  in  a 
more  general  setting.  At  the  same  time  it  is  worth  noting  that  the  overall  throughput  also  increases 
by  almost  7%.  However,  in  this  case,  this  is  mainly  due  to  the  bursty  nature  of  54's  traffic  which 
cannot  use  the  entire  capacity  of  link  3  when  it  is  the  only  one  using  it,  rather  than  load  balancing. 

6.4.4  Experiment  3:  Load  Distribution  and  Load  Balancing 

This  experiment  shows  how  the  load  distribution  affects  the  effectiveness  of  our  load  balancing 
scheme.  For  this  purpose,  consider  the  topology  in  Figure  6.6(a).  In  the  first  simulation  we  gener¬ 
ate  flows  that  have  the  source  and  the  destination  uniformly  distributed  among  users.  Figure  6.6(b) 
shows  the  means  of  the  overall  throughputs  under  BASE,  STATIC,  and  DYNAMIC-2,  respectively.^ 
Due  to  the  uniformity  of  the  traffic  pattern,  in  this  case  BASE  performs  very  well.  Under  STATIC 
we  get  slightly  larger  overall  throughput,  mainly  due  to  our  congestion  control  scheme,  which  ad¬ 
mits  a  marked  packet  only  if  there  is  a  high  probability  that  it  will  be  delivered.  However,  under 
DYNAMIC-2  the  performance  degrades.  This  is  because  there  are  times  when  our  probabilistic 
routing  algorithm  selects  longer  routes,  which  leads  to  inefficient  resource  utilization. 

^We  have  also  computed  standard  deviations  for  each  case:  the  largest  standard  deviation  was  0.342  for  the  overall 
throughput  under  STATIC  scheme,  and  0.4  for  the  in-profile  throughput  under  DYNAMIC-2. 
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Figure  6.7:  Topology  similar  to  the  T3  topology  of  the  NSFNET  backbone  network  containing  the  IBM 
NSS  nodes. 


Next,  we  consider  an  unbalanced  load  by  making  users  S3  and  54  inactive.  Figure  6.6(c)  shows 
throughput  means  under  BASE,  STATIC,  and  DYNAMIC-2,  respectively.  As  it  can  be  noticed, 
using  DYNAMIC-2  increases  the  mean  by  30%.  This  is  because  under  BASE  and  STATIC  schemes 
the  entire  traffic  between  51,  52  and  55,  56  is  routed  through  links  3  and  4  only.  On  the  other  hand, 
DYNAMIC-2  takes  advantage  of  the  alternate  route  through  links  1  and  2. 

Finally,  in  another  simulation  not  shown  here  we  considered  the  scenario  in  which  55,  56,  57, 
and  58  send  their  entire  traffic  to  53  and  54,  respectively.  In  this  case  DYNAMIC-2  outperforms 
STATIC  and  BASE  by  almost  two  times  in  terms  of  in-profile  and  overall  throughputs.  This  is  again 
because  BASE  and  STATIC  exclusively  use  links  3  and  2,  while  DYNAMIC-2  is  able  to  use  all  four 
links. 


6.4.5  Experiment  4:  Large  Scale  Example 

In  this  section  we  consider  a  larger  topology  that  closely  resembles  the  T3  topology  of  the  NSFNET 
backbone  containing  the  IBM  NSS  nodes  (see  Figure  6.7).  The  major  difference  is  that  in  order  to 
limit  the  simulation  time  we  assume  10  Mbps  links,  instead  of  45  Mbps.  We  consider  the  following 
three  scenarios. 
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Figure  6.8:  The  throughputs  when  the  load  is  balanced  (Figure  6.7(a)),  (b)  unbalanced  ((Figure  6.7(b)),  and 
(c)  when  the  network  is  virtually  partitioned  (Figure  6.7(c)). 


In  the  first  scenario  we  assume  that  load  is  uniformly  distributed,  i.e.,  any  two  users  communi¬ 
cate  with  the  same  probability.  Figure  6.8(a)  shows  the  results  for  each  scheme  which  are  consistent 
with  the  ones  obtained  in  the  previous  experiment.  Due  to  the  congestion  control  which  reduces  the 
number  of  dropped  packets  in  the  network,  STATIC  achieves  higher  throughput  than  BASE.  On  the 
other  hand,  dynamic  routing  and  load  balancing  are  not  effective  in  this  case,  since  they  tend  to  gen¬ 
erate  longer  routes  which  leads  to  inefficient  resource  utilization.  This  is  illustrated  by  the  decrease 
of  the  overall  and  the  in-profile  throughputs  under  DYNAMIC-2  and  DYNAMIC-3,  respectively. 

In  the  second  scenario  we  assume  unbalanced  load.  More  precisely,  we  consider  1 1  users  (cov¬ 
ered  by  the  shaded  area  in  Figure  6.7(b))  which  are  nine  times  more  active  than  the  others,  i.e.,  they 
send/receive  nine  times  more  traffic.^  Unlike  the  previous  scenario,  in  terms  of  overall  throughput, 
DYNAMIC-2  outperforms  STATIC  by  almost  8%,  and  BASE  by  almost  20%  (see  Figure  6.8(b)). 
This  is  because  DYNAMIC-2  is  able  to  use  some  of  the  idle  links  from  the  un-shaded  partition. 
However,  as  shown  by  the  results  for  DYNAMIC-3,  as  the  number  of  alternate  paths  increases  both 
the  overall  and  in-profile  throughputs  start  to  decrease. 

In  the  final  scenario  we  consider  the  partition  of  the  network  shown  in  Figure  6.7(c).  For  sim¬ 
plicity,  we  assume  that  only  users  in  the  same  partition  communicate  between  them.  This  scenario 
models  a  virtual  private  network  (VPN)  setting,  where  each  partition  corresponds  to  a  VPN.  Again, 
DYNAMIC-2  performs  best^  since  it  is  able  to  make  use  of  some  links  between  partitions  that 
otherwise  would  remain  idle. 

Finally,  we  note  that  across  all  simulations  presented  in  this  section,  the  dropping  rate  for  the 
marked  packets  was  never  larger  than  0.3%.  At  the  same  time  the  dropping  rate  for  the  unmarked 
packets  was  over  40%. 

^This  might  model  the  real  situation  where  the  east  coast  is  more  active  than  the  west  coast  between  9  and  12  a.m. 
EST. 

^The  mean  of  the  user  overall  throughput  under  DYNAMIC-2  is  15%  larger  than  under  STATIC,  and  18%  larger  than 
under  BASE. 
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6.4.6  Summary  of  Simulation  Results 

Although  the  experiments  in  this  section  are  far  from  being  exhaustive,  we  believe  that  they  give  a 
reasonable  image  of  how  our  scheme  performs.  First,  our  scheme  is  effective  in  providing  service 
differentiation  at  the  user  level.  Specifically,  the  first  two  experiments  show  that  users  with  simi¬ 
lar  communication  patterns  get  service  in  proportion  to  their  token  rates.  Second,  at  least  for  the 
topologies  and  the  traffic  model  considered  in  these  experiments,  our  scheme  ensures  that  marked 
packets  are  delivered  to  the  destination  with  high  probability. 

Consistent  with  other  studies  [68],  these  experiments  show  that  performing  dynamic  routing 
and  load  balancing  make  little  sense  when  the  load  is  already  balanced.  In  fact,  using  dynamic 
routing  and  load  balancing  can  actually  hurt,  since,  as  noted  above,  this  will  generate  longer  routes 
which  may  result  in  inefficient  resource  utilization.  However,  when  the  load  is  unbalanced,  using 
DYNAMIC-A:  can  significantly  increase  the  utilization  and  achieve  a  higher  degree  of  fairness. 

Finally,  we  note  that  the  in-profile  dropping  rate  decreases  as  the  the  number  of  alternate  paths 
increases.  For  example  in  the  last  experiment  in  the  first  two  scenarios  the  dropping  rate  is  no  larger 
than  0.3%  under  STATIC  and  0%  under  DYNAMIC-2  and  DYNAMIC-3,  respectively,  while  in  the 
last  scenario  the  percentage  decreases  from  0.129%  for  STATIC,  to  0.101%  for  DYNAMIC-2,  and 
to  0.054%  for  DYNAMIC-3. 


6.5  Discussion 

In  this  chapter,  we  have  studied  a  differential  service  model,  LIRA,  in  which,  unlike  the  Assured 
service  [23,  24],  the  service  profile  is  specified  in  terms  of  resource  tokens  instead  of  absolute 
bandwidth.  Since  the  exact  bandwidth  of  marked  bits  that  a  customer  can  receive  from  such  a 
service  is  not  known  a  priori,  a  natural  question  to  ask  is  why  such  a  service  model  is  interesting? 

There  are  several  reasons.  First,  we  believe  that  the  apriori  specification  of  an  absolute  amount 
of  bandwidth  in  the  service  profile,  though  desirable,  is  not  essential.  In  particular,  we  believe  that 
the  essential  aspects  that  distinguish  Diffserv  from  Intserv  are  the  following:  (a)  the  service  profile 
is  used  for  traffic  aggregates  which  are  much  coarser  than  per  flow  traffic,  and  (b)  the  service  profile 
is  defined  over  a  timescale  larger  than  the  duration  of  individual  flows,  i.e.  service  profile  is  rather 
static.  Notice  that  the  degree  of  traffic  aggregation  directly  relates  to  the  spatial  granularity  of  the 
service  profile.  On  the  one  hand,  if  each  service  profile  is  defined  for  only  one  destination,  we 
have  the  smallest  degree  of  traffic  aggregation.  If  there  are  N  possible  egress  nodes  for  a  user, 
N  independent  service  profiles  need  to  be  defined.  Network  provisioning  is  relatively  easy  as  the 
entire  traffic  matrix  between  all  egress  and  ingress  nodes  is  known.  However,  if  a  user  has  a  rather 
dynamic  distribution  of  egress  nodes  for  its  traffic,  i.e.,  the  amount  of  traffic  destined  to  each  egress 
node  varies  significantly,  and  the  number  of  possible  egress  nodes  is  large,  such  a  scheme  will 
significantly  reduce  the  chance  of  statistical  sharing.  On  the  other  hand,  if  each  service  profile  is 
defined  for  all  egress  nodes,  we  have  the  largest  degree  of  traffic  aggregation.  Only  one  service 
profile  is  needed  for  each  user  regardless  of  the  number  of  possible  egress  nodes.  In  addition  to  a 
smaller  number  of  service  profiles,  such  a  service  model  also  allows  all  the  traffic  from  the  same 
user,  regardless  of  its  destination,  to  statistically  share  the  same  service  profile.  The  flip  side  is  that 
it  makes  it  difficult  to  provision  network  resources.  Since  the  traffic  matrix  is  not  known  apriori,  the 


103 


best-case  scenario  is  when  the  network  traffic  is  evenly  distributed,  and  the  worst-case  scenario  is 
when  all  traffic  goes  to  the  same  egress  router. 

Therefore,  it  is  veiy  difficult,  if  not  impossible,  to  design  service  profiles  that  (1)  are  static,  (2) 
support  coarse  spatial  granularity,  (3)  are  defined  in  terms  of  absolute  bandwidth,  and  at  the  same 
time  achieve  (4)  high  service  assurance  and  (5)  high  resource  utilization.  Since  we  feel  that  (1),  (2), 
(4)  and  (5)  are  the  most  important  for  differential  sei*vices,  we  decided  to  give  up  (3). 

Fundamentally,  we  want  a  service  profile  that  is  static  and  path-independent.  However,  to 
achieve  high  utilization,  we  need  to  explicitly  address  the  fact  that  congestion  is  a  local  and  dynamic 
phenomenon.  Our  solution  is  to  have  two  levels  of  differentiation:  (a)  the  user  or  service-profile  level 
differentiation,  which  is  based  on  resource  token  anival  rate  that  is  static  and  path  independent,  and 
(b)  the  packet  level  differentiation,  which  is  a  simple  priority  between  marked  and  unmarked  pack¬ 
ets  and  weighted  fair  share  among  marked  packets.  By  dynamically  setting  the  cost  of  each  marked 
bit  as  a  function  of  the  congestion  level  of  the  path  it  traverses,  we  set  up  the  linkage  between  the 
static/path-independent  and  the  dynamic/path-dependent  components  of  the  service  model. 

A  second  reason  our  service  model  may  be  acceptable  is  that  users  may  care  more  about  the 
differential  aspect  of  the  service  than  the  guaranteed  bandwidth.  For  example,  if  user  A  pays  twice 
as  much  as  user  B,  user  A  would  expect  to  have  roughly  twice  as  much  traffic  delivered  as  user  B 
during  congestion  if  they  share  same  congested  links.  This  is  exactly  what  we  accomplish  in  LIRA. 

A  third  reason  a  fixed-resource-token-rate- variable-bandwidth  service  profile  may  be  acceptable 
is  that  the  user  traffic  is  usually  bursty  over  multiple  time-scales  [27,  80,  118].  Thus,  there  is  a 
fundamental  mismatch  between  an  absolute  bandwidth  profile  and  the  bursty  nature  of  the  traffic. 

We  do  recognize  the  fact  that  it  is  desirable  for  both  the  user  and  the  ISP  to  understand  the 
relationship  between  the  user's  resource  token  rate  and  its  expected  capacity.  This  can  be  achieved 
by  measuring  the  rate  of  marked  bits  given  a  fixed  token  rate.  Both  the  user  and  the  ISP  can  perform 
this  measurement.  In  fact,  this  suggests  two  possible  scenarios  in  which  LIRA  can  be  used  to  provide 
a  differential  service  with  an  expected  capacity  defined  in  terms  of  absolute  bandwidth.  In  the  first 
scenario,  the  service  is  not  transparent.  Initially,  the  ISP  will  provide  the  user  with  the  following 
relationship 


expected-capacity  =  /(token_raterfraffic_mix)  (6.7) 

based  on  its  own  prior  measurement.  The  user  will  measure  the  expected  capacity  and  then  make 
adjustments  by  asking  for  an  increase  or  a  decrease  in  its  resource  token  rate.  In  the  second  scenario, 
the  service  is  transparent.  Both  the  initial  setting  and  the  subsequent  adjustments  of  the  service 
profile  in  terms  of  token  rate  will  be  made  by  the  ISP  only. 

Therefore,  one  way  of  thinking  about  our  scheme  is  that  it  provides  a  flexible  and  efficient 
framework  for  implementing  a  variety  of  Assured  Services.  In  addition,  the  dynamic  link  cost 
information  and  the  statistics  of  the  resource  token  bucket  history  provide  good  feedback  both  for 
individual  applications  to  perform  runtime  adaptation,  and  for  the  user  or  the  ISP  to  do  proper 
accounting  and  provisioning. 
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6.6  Related  Work 


The  LIRA  service  is  highly  influenced  by  Clark  and  Wroclawski’s  Assured  service  proposal  [23, 
24].  The  key  difference  is  that  we  deflne  service  profiles  in  units  of  resource  tokens  rather  than 
absolute  bandwidth.  In  addition,  we  propose  a  resource  accounting  scheme  and  an  integrated  set  of 
algorithms  to  implement  our  service  model. 

Another  related  proposal  is  the  User-Share  Differentiation  (USD)  [116]  scheme,  which  does  not 
assume  absolute  bandwidth  profiles.  In  fact,  with  USD,  a  user  is  assigned  a  share  rather  than  a 
token-bucket-based  service  profile.  For  each  congested  link  in  the  network  traversed  by  the  user’s 
traffic,  the  user  shai^es  the  bandwidth  with  other  users  in  proportion  to  its  share.  The  service  provided 
is  equivalent  to  one  in  which  each  link  in  a  network  implements  a  weighted  fair  queueing  scheduler 
where  the  weight  is  the  user’s  share.  With  USD,  there  is  little  coirelation  between  the  share  of  a  user 
and  the  aggregate  thi'oughput  it  will  receive.  For  example,  two  users  that  are  assigned  the  same  share 
can  see  drastically  different  aggregate  thi'oughputs.  A  user  that  has  traffic  for  many  destinations 
(thus  traverse  many  different  paths)  can  potentially  receive  much  higher  aggregate  thi*oughput  than 
a  user  that  has  traffic  for  only  a  few  destinations. 

Waldspurger  and  Weihl  have  proposed  a  framework  for  resource  management  based  on  lottery 
tickets  [1 13, 1 14].  Each  client  is  assigned  a  certain  number  of  tickets  which  encapsulate  its  resource 
rights.  The  number  of  tickets  a  user  receives  is  similar  to  the  user’s  income  rate  in  LIRA,  This 
framework  was  shown  to  provide  flexible  management  for  various  single  resources,  such  as  disk, 
memory  and  CPU.  However,  they  do  not  give  any  algorithm(s)  to  coordinate  ticket  allocation  among 
multiple  resources. 

To  increase  resource  utilization,  in  this  chapter  we  propose  performing  dynamic  routing  and  load 
balancing  among  the  best  k  shortest  paths  between  source  and  destination.  In  this  context,  one  of  the 
first  dynamic  routing  algorithms,  which  uses  the  link  delay  as  metric,  was  the  ARPANET  shortest 
path  first  [71].  Unfortunately,  the  sensitivity  of  this  metric  when  the  link  utilization  approaches 
unity  resulted  to  relatively  poor  performances.  Various  routing  algorithms  based  on  congestion 
control  information  were  proposed  elsewhere  [43, 46].  The  unique  aspect  of  our  algorithm  is  that  it 
combines  dynamic  routing,  congestion  control  and  load  balancing.  We  also  alleviate  the  problem  of 
system  stability  which  plagued  many  of  the  previous  dynamic  routing  algorithms  by  defining  a  more 
robust  cost  function  and  probabilistically  binding  a  flow  to  a  route.  We  also  note  that  our  link  cost 
is  similar  to  the  one  used  by  Ma  et  al.  [68].  In  particular,  it  can  be  shown  that  when  all  links  have 
the  same  capacity,  our  link  cost  is  within  a  constant  factor  of  the  cost  of  shortest-dist(P,  1)  algorithm 
presented  Ma  et  al.  [68].  It  is  worth  noting  that  shortest-dist(P,  1)  performed  the  best  among  all  the 
algorithms  studied  there. 


6.7  Summary 

In  this  chapter  we  have  proposed  an  Assured  service  model  in  which  the  service-profile  is  defined 
in  units  of  resource  tokens  rather  than  the  absolute  bandwidth,  and  an  accounting  scheme  that  dy¬ 
namically  determines  the  number  of  resource  tokens  charged  for  each  in-profile  packet.  We  have 
presented  a  set  of  algorithms  that  efficiently  implement  the  service  model.  In  particular,  we  in¬ 
troduced  three  techniques:  (a)  distributing  path  costs  to  all  edge  nodes  by  leveraging  the  existing 
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routing  infrastructure;  (b)  binding  a  flow  to  a  route  (route-pinning);  (c)  multi-path  routing  and  prob¬ 
abilistic  binding  of  flows  to  paths  to  achieve  load  balancing. 

To  implement  route-pinning,  which  is  arguably  the  most  complex  technique  of  the  three,  we 
have  used  DPS.  By  using  DPS,  we  have  been  able  to  efficiently  implement  the  LIRA  service  model 
in  a  SCORE  network.  We  have  presented  simulation  results  to  demonstrate  the  effectiveness  of  the 
approach.  To  the  best  of  our  knowledge,  this  is  the  first  complete  scheme  that  explicitly  addresses 
the  issue  of  large  spatial  granularities. 
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Chapter  7 


Making  SCORE  more  Robust  and  Scalable 


While  SCORE/DPS  based  solutions  are  much  more  scalable  and,  in  the  case  of  fail-stop  failures, 
more  robust  than  their  stateful  counteiparts,  they  are  less  scalable  and  robust  than  the  stateless 
solutions.  The  scalability  of  the  SCORE  architecture  suffers  from  the  fact  that  the  network  core 
cannot  transcend  trust  boundaries,  such  as  the  boundary  between  two  competing  Internet  Service 
Providers  (ISPs).  As  a  result,  the  high-speed  routers  on  these  boundaries  must  be  stateful  edge 
routers.  The  lack  of  robustness  is  because  the  malfunctioning  of  a  single  edge  or  core  router  that 
inserts  erroneous  state  in  the  packet  headers  could  severely  impact  the  perfoimance  of  an  entire 
SCORE  network. 

In  this  chapter,  we  discuss  an  extension  to  the  SCORE  architecture,  called  “verify-and-protect”, 
that  overcomes  these  limitations.  We  achieve  scalability  by  pushing  the  complexity  all  the  way  to 
end-hosts,  and  therefore  eliminate  the  distinction  between  core  and  edge  routers.  To  address  the 
trust  and  robustness  issues,  all  routers  statistically  verify  that  the  incoming  packets  are  correctly 
marked,  i.e.,  that  they  cany  consistent  state.  This  approach  enables  routers  to  discover  and  isolate 
misbehaving  end-hosts  and  routers.  While  this  approach  requires  routers  to  maintain  state  for  each 
flow  that  is  verified,  in  practice,  this  does  not  compromise  the  scalability  of  core  routers  as  the 
amount  of  state  maintained  by  these  routers  is  very  small.  In  practice,  as  discussed  in  Section  7.3, 
the  number  of  flows  that  a  router  needs  to  verify  simultaneously  -  flows  for  which  the  router  has  to 
maintain  state  -  is  on  the  order  of  tens.  We  illustrate  the  “verify-and-protect”  approach  in  the  context 
of  Core-Stateless  Fair  Queueing  (CSFQ),  by  developing  tests  to  accurately  identify  misbehaving 
nodes,  and  present  simulation  results  to  demonstrate  the  effectiveness  of  this  approach. 

The  remainder  of  this  chapter  is  organized  as  follows.  The  next  section  describes  the  failure 
model  assumed  throughout  this  chapter.  Section  7.2  presents  the  components  of  the  “verify-and- 
protect”  approach,  while  Section  7.3  describes  the  details  of  the  flow  verification  algorithm  in  the 
case  of  CSFQ.  Section  7.4  proposes  a  robust  test  to  identify  the  misbehaving  nodes.  Finally,  Sec¬ 
tion  7.5  presents  simulation  results,  while  Section  7.6  summarizes  our  findings. 


7.1  Failure  Model 

In  this  chapter,  we  assume  a  partial  failure  model  in  which  a  router  or  end-host  misbehaves  by 
sending  packets  carrying  inconsistent  information.  A  packet  is  said  to  carry  inconsistent  information 
(or  state),  if  this  information  does  not  correctly  reflect  the  flow  behavior.  In  particular,  with  CSFQ, 
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Figure  7.1:  Three  flows  arriving  at  a  CSFQ  router:  flow  I  is  consistent,  flow  2  is  downward-inconsistent, 
and  flow  3  is  upward-inconsistent. 


a  packet  is  said  to  cany  inconsistent  infoimation  if  the  difference  between  the  estimated  rate  in  its 
header  and  the  actual  flow  rate  exceeds  some  predefined  threshold  (see  Section  7.3.2  for  details). 
We  use  a  range  test,  instead  of  an  equality  test,  to  account  for  the  rate  estimation  inaccuracies  due  to 
the  delay  jitter  and  the  probabilistic  dropping  scheme  employed  by  CSFQ.  A  node  that  changes  the 
DPS  state  earned  by  a  packet  from  consistent  into  inconsistent  is  said  to  misbehave.  In  this  chapter, 
we  use  the  term  of  node  for  both  a  router  and  an  end-host. 

A  misbehaving  node  can  affect  a  subset  or  all  flows  that  traverse  the  node.  As  an  example, 
an  end-host  or  an  egress  router  of  an  ISP  may  intentionally  modify  state  infoimation  earned  by 
the  packets  of  a  subset  of  flows  hoping  that  these  flows  will  get  a  better  treatment  while  traversing 
a  down-stream  ISP.  In  contrast,  a  router  that  experiences  a  malfunction  may  affect  all  flows  by 
randomly  dropping  their  packets. 

A  flow  whose  packets  cairy  inconsistent  information  is  called  inconsistent;  otherwise  it  is  called 
consistent.  We  differentiate  between  two  types  of  inconsistent  flows.  If  the  packets  of  a  flow  cairy 
a  rate  smaller  than  the  actual  flow  rate,  we  say  that  the  flow  is  downward-inconsistent.  Similarly, 
if  the  packets  carry  a  rate  that  is  larger  than  the  actual  flow  rate  we  say  that  the  flow  is  upward- 
inconsistent.  Figure  7.1  shows  an  example  involving  three  flows  aniving  at  a  CSFQ  core  router: 
flow  1  is  consistent,  flow  2  is  downward-inconsistent,  as  its  arrival  rate  is  10,  but  its  packets  cany 
an  estimated  rate  of  only  5,  and  flow  3  is  upward-inconsistent  since  it  has  an  airival  rate  of  3,  but 
its  packets  cany  an  estimated  rate  of  5.  As  we  will  show  in  the  next  section,  of  the  two  types  of 
inconsistent  flows,  the  downward-inconsistent  ones  are  more  dangerous  as  they  can  steal  bandwidth 
from  the  consistent  flows.  In  contrast,  upward-inconsistent  flows  can  only  hurt  themselves. 

In  summary,  we  assume  ordy  node  failures  that  result  in  forwarding  packets  with  inconsistent 
state.  We  do  not  consider  general  failures  such  as  a  node  writing  a  packet  IP  header,  e.g.,  spoofing 
the  IP  destination  or/and  source  addresses,  or  dropping  all  packets  of  a  flow. 

7.1.1  Example 

In  this  section,  we  first  illustrate  the  impact  of  an  inconsistent  flow  on  other  consistent  flows  that 
share  the  same  link.  In  particular,  we  show  that  a  downward-inconsistent  flow  may  deny  the  service 
to  consistent  flows.  Then  we  illustrate  the  impact  that  a  misbehaving  router  can  have  on  the  traffic 
in  the  entire  domain. 

Consider  a  basic  scenario  in  which  three  flows  with  rates  of  8,  6,  and  2  Mbps,  respectively,  share 
a  10  Mbps  link.  According  to  Eq.  4.1,  the  fair  rate  in  this  case  is  4  Mbps^ .  As  a  result,  the  first  two 
flows  get  4  Mbps  each,  while  flow  3  gets  exactly  2  Mbps. 

‘This  is  obtained  by  solving  the  equation:  min(a,  8)  -h  min(a,  6)  -1-  min(a,  2)  =  10. 
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Figure  7.2:  (a)  A  CSFQ  core  router  cannot  differentiate  between  an  inconsistent  flow  with  an  arrival  rate  of 
8,  whose  packets  carry  an  estimated  rate  of  1,  and  8  consistent  flows,  each  having  an  arrival  rate  of  1.  (b) 
Since  CSFQ  assumes  implicitly  that  all  flows  are  consistent  it  will  allocate  a  rate  of  8  to  the  inconsistent  flow, 
and  a  rate  of  1  to  consistent  flows.  The  crosses  indicate  dropped  packets. 

Next,  assume  that  the  first  flow  is  downward-inconsistent.  In  particular,  its  packets  carry  an 
estimated  rate  of  1  Mbps,  instead  of  8  Mbps.  It  is  easy  to  see  then  that  such  a  scenario  will  break 
CSFQ.  Intuitively,  this  is  because  a  core  router  cannot  differentiate  -  based  only  on  the  information 
carried  by  the  packets  -  between  the  case  of  an  8  Mbps  inconsistent  flow,  and  the  case  of  8  consistent 
flows  sending  at  1  Mbps  each  (see  Figure  7.2(a)).  In  fact,  CSFQ  will  assume  by  default  that  the 
information  carried  by  all  packets  is  consistent,  and,  as  a  result,  will  compute  a  fair  rate  of  1  Mbps 
(see  Figure  7.2(b)).^  Thus,  while  the  other  two  flows  get  1  Mbps  each,  the  inconsistent  flow  will  get 
8  Mbps! 

Worse  yet,  a  misbehaving  router  can  affect  not  only  the  traffic  it  forwards,  but  also  the  traffic 
of  other  down-stream  routers.  Consider  the  example  in  Figure  7.3(a)  in  which  the  black  router  on 
the  path  of  flow  1  misbehaves  by  under-estimating  the  rate  of  flow  1.  As  illustrated  by  the  previous 
example,  this  will  cause  down-stream  routers  to  unfairly  allocate  more  bandwidth  to  flow  1,  hurting 
in  this  way  the  consistent  traffic.  In  this  example,  flow  1  will  affect  both  flows  2  and  3.  In  contrast, 
in  a  stateful  network,  in  which  each  router  implements  Fair  Queueing,  a  misbehaving  router  can 
hurt  only  the  flows  it  forwards.  In  the  scenario  shown  in  Figure  7.3(b),  the  misbehaving  router  will 
affect  only  flow  1,  while  the  other  two  flows  will  not  be  affected. 
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7.2  The  “Verify-and-Protect”  Approach 

To  address  the  robustness  and  improve  the  scalability  of  the  SCORE  architecture  we  consider  a 
“verify-and-protect”  extension  of  this  architecture.  We  achieve  scalability  by  pushing  the  complex¬ 
ity  all  the  way  to  end-hosts  and  eliminate  the  concept  of  the  core-edge  distinction.  To  address  the 

^This  is  obtained  by  solving  the  equation:  min(2,  a)  +  min(6,  cv)  -I-  8  x  min(l,  a)  =  10. 
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Figure  7.3;  (a)  An  example  illustrating  how  a  misbehaving  router  (represented  by  the  black  box)  can  affect 
the  down-stream  consistent  traffic  in  the  case  of  CSFQ.  In  particular,  the  misbehaving  router  will  affect  flow 
1,  which  in  turn  affects  flows  2  and  3  as  they  share  the  same  down-stream  links  with  flow  I .  (b)  In  the  case 
of  Fair  Queueing  the  misbehaving  router  will  affect  only  flow  1 ;  the  other  two  flows  are  not  affected. 


trust  and  robustness  issues,  all  routers  stati.stically  verify  that  the  incoming  packets  are  con'ectly 
marked.  This  approach  enables  routers  to  discover  and  isolate  misbehaving  end-hosts  and  routers. 

The  “verify-and-protect”  extension  consists  of  three  components:  (1)  identification  of  the  mis¬ 
behaving  nodes,  (2)  protection  of  the  consistent  traffic  against  the  inconsistent  traffic  forwarded  by 
the  misbehaving  node,  and  (3)  recovery  from  the  protection  mode  if  the  misbehaving  node  heals. 
Next,  we  briefly  discuss  these  components  in  more  detail. 

7.2.1  Node  Identification 

Node  identification  builds  on  the  fact  that  with  DPS  a  core  router  can  easily  verify  whether  a  flow  is 
inconsistent  or  not.  This  can  be  simply  done  by  having  a  router  (1)  monitor  a  flow,  (2)  re-constioict 
its  state,  and  then  (3)  check  whether  the  reconstructed  state  matches  the  state  canied  by  the  flows’ 
packets.  We  call  this  procedure  flow  verification.  In  the  case  of  CSFQ,  flow  verification  consists 
of  re-estimating  a  flow’s  rate  and  then  comparing  it  against  the  estimated  rate  canied  by  the  flow’s 
packets.  If  the  two  rates  are  within  some  predefined  distance  from  each  other  (see  Section  7.3.2)  we 
say  that  the  flow  is  consistent;  otherwise,  we  say  that  the  flow  is  inconsistent. 

Since  in  our  case  a  misbehaving  node  is  defined  as  a  node  which  forwards  packets  carrying 
inconsistent  state,  a  simple  identification  algorithm  would  be  to  have  each  router  monitor  the  in¬ 
coming  flows.  Then,  if  the  router  detects  an  inconsistent  flow,  it  will  conclude  that  the  up-stream 
node  misbehaves.^ 

The  drawback  of  this  approach  is  that  it  requires  core  routers  to  monitor  each  incoming  flow, 
which  will  compromise  the  scalability  of  our  architecture.  To  get  around  this  problem,  we  limit  the 
number  of  flows  that  are  monitored  to  a  small  sub-set  of  all  arriving  flows.  While  this  approach  does 

^For  simplicity,  here  we  assume  that  we  can  decide  that  a  node  misbehaves  based  on  a  single  inconsistent  flow. 
However,  as  we  will  show  in  Section  7.3.2,  in  the  case  of  CSFQ  we  have  to  perform  more  than  one  flow  tests  to  accurately 
identify  a  misbehaving  node. 
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not  guarantee  that  every  inconsistent  flow  is  identified,  it  is  still  effective  in  detecting  misbehaving 
nodes.  This  is  because,  at  the  limit,  identifying  one  inconsistent  flow  is  enough  to  conclude  that  an 
up-stream  node  misbehaves.  However,  note  that  in  this  case  we  can  no  longer  be  certain  that  the 
up-stream  neighbor  misbehaves;  it  can  be  the  case  that  another  up-stream  node  misbehaves  but  the 
•  intermediate  nodes  fail  to  identify  it. 

7.2.2  Protection 

Once  a  router  identifies  a  misbehaving  flow,  the  next  step  is  to  protect  the  consistent  traffic  against 
this  flow.  One  approach  would  be  to  penalize  the  inconsistent  flows  only.  The  problem  with  this  ap¬ 
proach  is  that  it  is  necessaiy  to  maintain  state  for  all  inconsistent  flows.  If  the  number  of  inconsistent 
flows  is  large,  this  approach  will  compromise  the  scalability  of  the  core  routers. 

An  second  approach  would  be  to  penalize  all  flows  which  arrive  from  a  misbehaving  node.  In 
particular,  once  a  router  concludes  that  an  up-stream  node  misbehaves,  it  penalizes  all  flows  that  are 
coming  from  that  node,  no  matter  whether  they  are  inconsistent  or  not.  While  this  approach  may 
seem  overly  conservative,  it  is  consistent  with  our  failure  model,  which  considers  only  node,  not 
flow,  failure. 

Finally,  a  third  approach  would  be  to  announce  the  failure  at  a  higher  administrative  level  — 
for  example,  to  the  network  administrator  —  that  can  then  take  the  appropriate  action.  At  the  limit, 
the  network  administrator  can  simply  shut-down  the  misbehaving  router  and  reroute  the  traffic.  A 
variation  of  this  scheme  would  be  to  design  a  routing  protocol  that  automatically  reroutes  the  traffic 
when  a  misbehaving  router  is  identified. 

In  the  example  studied  in  this  chapter,  i.e.,  in  the  case  of  CSFQ,  we  assume  the  second  approach 
(see  Section  7.5). 

7.2.3  Recovery 

In  many  cases  the  failure  of  a  node  can  be  transient,  i.e.,  after  forwarding  misbehaving  traffic  for  a 
certain  time,  a  node  may  stop  doing  so.  In  this  case,  the  down-stream  node  should  detect  this,  and 
stop  punishing  the  traffic  arriving  from  that  node.  Again,  this  can  be  easily  implemented  by  using 
flow  verification.  If  a  router  does  not  detect  any  ill-behaved  flow  for  a  predefined  period  of  time,  it 
can  decide  then  that  the  up-stream  node  no  longer  misbehaves,  and  stop  punishing  its  flows. 


7.3  Flow  Verification 

At  the  basis  of  the  “verify-and-protect”  approach  lies  the  ability  to  identify  misbehaving  nodes.  In 
turn,  this  builds  on  the  ability  to  perform  flow  verification  to  detect  whether  a  flow  is  consistent  or 
not.  In  this  section  we  describe  the  flow  verification  algorithm,  and  propose  a  test  to  check  for  flow’s 
consistency  in  the  case  of  CSFQ. 

We  assume  that  a  flow  is  uniquely  identified  by  its  source  and  destination  IP  addresses.  This 
makes  packet  classification  easy  to  implement  at  very  high  speeds.  For  most  practical  purposes,  we 
can  use  a  simple  hash  table  with  the  hash  keys  computed  over  the  source  and  destination  address 
fields  in  the  packet’s  IP  header. 
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upon  packet  p  amval: 

//  if  flow  to  which  p  belong  is  not  monitored, 

//  and  monitoring  list  is  not  full  start  to  monitor  it 
f  =  get_flow„filter(;>); 
if  (/  ^  monitoring Jist) 
if  (size(monitoriiig_list)  <  M) 

//start  to  monitor f 
insei1(monitoring_list.  /); 
in\{{f  .rate); 

fMartJ/iine  =  fxrtJiinc; 
else  //  update  f  state 
update(/..s^oic,;>); 
ii  {f  .crtJiiiic  —  /.start Jime  > 
flow_id_test(/..s/aic, 

delete(monitoring_list,  /); 


Figure  7.4:  The  pseudocode  of  the  flow  verification  algorithm. 


We  consider  a  router  architecture  in  which  the  monitoring  function  is  implemented  at  the  input 
ports.  Performing  flow  monitoring  at  inputs,  rather  than  outputs,  allows  us  to  detect  inconsistent 
flows  as  soon  as  possible,  and  therefore  limit  the  impact  that  these  flows  might  have  on  the  consistent 
flows  traversing  the  router.  Without  loss  of  generality  we  assume  an  on-line  verification  algorithm. 
The  pseudocode  of  the  algorithm  is  shown  in  Figure  7.4.  We  assume  that  an  input  can  monitor 
up  to  M  flows  simultaneously.  Upon  a  packet  arrival,  we  first  check  whether  the  flow  it  belongs 
to  is  already  monitored.  If  not,  and  if  less  than  M  flows  are  monitored,  we  add  the  flow  to  the 
monitoring  list.  A  flow  is  monitored  for  an  interval  of  length  At  the  end  of  this  interval  the 

router  computes  an  estimate  of  the  flow  rate  and  compares  it  against  the  rate  carried  by  the  flow's 
packets.  Based  on  this  comparison,  the  router  decides  whether  the  flow  is  consistent  or  not.  We  call 
this  test  the  flow  identification  test. 

A  flow  that  fails  this  test  is  classified  as  inconsistent.  In  an  ideal  fluid  flow  system  a  flow  would 
be  classified  as  consistent  if  the  rate  estimated  at  the  end  of  the  monitoring  interval  is  equal  to 
the  rate  caixied  by  the  flow’s  packets.  Unfortunately,  in  a  real  system  such  a  simple  test  will  fail 
due  to  inaccuracies  introduced  by  (I)  the  rate  estimation  algorithm,  (2)  the  delay  jitter,  and  (3)  the 
probabilistic  buffer  management  scheme  employed  by  CSFQ.  In  Section  7.3.2  we  present  a  flow 
identification  test  that  is  robust  in  the  presence  of  these  inaccuracies. 

One  important  question  is  whether  the  need  to  maintain  state  for  each  flow  that  is  verified  does 
not  compromise  the  scalability  of  our  approach.  We  answer  this  question  next.  Let  Tmon  be  the 
average  time  it  takes  a  router  to  verify  a  flow  (see  Table  7. 1).  Since  according  to  the  algorithm 
in  Figure  7.4,  a  new  flow  (to  be  verified)  is  selected  by  choosing  a  random  packet,  and  a  router 
can  verify  up  to  M  flows  simultaneously,  the  expected  time  to  select  an  inconsistent  flow  is  about 
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Notation 

Comments 

AI 

maximum  number  of  flows  simultaneously  monitored  at  an  input  port 

T 

mon 

monitoring  interval 

Tine 

expected  time  to  identify  a  flow  as  inconsistent 

Tl  in  c 

expected  number  of  tests  to  classify  a  flow  as  inconsistent 

kov 

overflow  factor  -  ratio  between  labels  carried  by  flow's  packets  at  the 
entrance  of  the  network  and  the  fair  rate  on  the  upstream  bottleneck  link 

kinc 

inconsistency  factor  -  vtiXio  between  flow's 
arrival  rate  and  labels  carried  by  flow's  packets 

m 

number  of  packets  sent  during  Tmon  at  the  fair  rate 
(assuming  fixed  packet  sizes) 

Sc 

event  that  a  tested  flow  is  consistent 

Sine 

event  that  a  tested  flow  is  inconsistent;  Fv{Sinc)  +  Pr(5c)  =  1 

Cc 

event  that  a  tested  flow  is  classified  as  consistent 

Cine 

event  that  a  tested  flow  is  classified  as  inconsistent 

Pc—inc  ~  I  Sc) 

probability  that  a  consistent  flow  is  misidentified 

Pinc—inc  ~  ^^{T'ine  I  Sjne) 

probability  that  an  inconsistent  flow  is  identified 

Pa 

probability  that  a  selected  flow  is  active  long  enough  to  be  tested 

Pine  “  Pa^^i^inc) 

probability  that  a  selected  flow  is  classified  as  inconsistent 

Pid  ~  Fr('^i7ic  1  T/inc) 

probability  to  identify  an  inconsistent  flow,  i.e., 

probability  that  a  flow  classified  as  inconsistent  is  indeed  inconsistent 

fine 

fraction  of  inconsistent  traffic 

Pr{Sinc) 

probability  to  select  an  inconsistent  flow;  we  assume  Pr(5h7c)  =  fine 

Table  7.1:  Notations  used  throughout  this  chapter.  For  simplicity,  the  notations  do  not  include  the  time 
argument  t. 


Tmon  /  {Mfiuc),  where  fine  represents  the  fraction  of  the  inconsistent  traffic  (see  Table  7.1).  Thus, 
the  expected  time  to  eventually  catch  an  inconsistent  flow  is  / {M As  a  result, 

it  makes  little  sense  to  choose  M  much  larger  than  1  //,„r,  as  this  will  only  marginally  reduce  the 
time  to  catch  a  flow.  In  fact,  if  we  choose  A/  ~  1  the  expected  time  to  eventually  catch  an 
inconsistent  flow  is  within  a  factor  of  two  of  the  optimal  value  Next,  it  is  important  to  note 

that  in  practice  we  can  ignore  the  inconsistent  traffic,  when  is  small.  Indeed,  in  the  worst  case, 
ignoring  the  inconsistent  traffic  is  equivalent  to  “losing”  only  a  fraction  of  the  link  capacity  to 
the  inconsistent  traffic.  For  example,  if /,:„r  =  1%,  even  if  we  ignore  the  inconsistent  traffic,  the 
consistent  traffic  will  still  receive  about  99%  of  the  link  capacity.  Given  the  various  approximations 
in  our  algorithms,  ignoring  the  inconsistent  traffic  when  fi„(.  is  on  the  order  of  a  few  percent  is 
acceptable  in  practice.  As  a  result,  we  expect  that  the  number  of  flows  that  are  simultaneously 
monitored,  M,  to  be  on  the  order  of  tens.  Note  that  M  does  not  depend  on  the  number  of  flows  that 
traverse  a  router,  a  number  that  can  be  much  larger,  i.e.,  on  the  order  of  hundred  of  thousands  or 
even  millions. 

7.3.1  Bufferless  Packet  System 

To  facilitate  the  discussion  of  the  flow  identification  test,  we  consider  a  simplified  buffer-less  packet 
system,  and  ignore  the  inaccuracies  due  to  the  rate  estimation  algorithm.  For  simplicity,  assume  that 
each  source  sends  constant-bit  rate  traffic,  and  that  all  packets  have  the  same  length.  We  consider 
only  end-host  and/or  router  misbehaviors  that  result  in  having  all  packets  of  a  flow  carry  a  label  that 
is  kinc  times  smaller  than  the  actual  flow  rate,  where  ^  1.  Thus,  we  as.sume  that  kinr.,  also 
called  inconsistency  factor,  does  not  change‘‘  during  the  life  of  a  flow.  In  Figure  7.1,  flow  2  has 
kinc  =  2,  while  flow  3  has  ki„r  =  3/5. 

We  consider  a  congested  link  between  two  routers  N]  and  N-^,  and  denote  it  by  (Ni:N2).  We 
assume  that  Ni  forwards  traffic  to  N2,  and  that  N-2  monitors  it.  Let  «  be  the  fair  rate  of  (Ni:N2). 
Then,  with  each  flow  that  arrives  at  and  which  is  forwarded  to  N‘2  along  (A/j  we  associate 
an  overflow  factor,  denoted  kov,  that  represents  the  ratio  between  the  labels  earned  by  the  flow’s 
packets  and  the  fair  rate  o  along  (N]  :N-2).  For  example,  in  Figure  7.1  each  flow  has  Ay,,.  =  5/3. 

73.2  Flow  Identification  Test 

In  this  section  we  present  the  flow  identification  test  for  CSFQ  in  the  bufferless  model.  First,  we 
show  why  designing  such  a  test  is  difficult.  In  particular,  we  demonstrate  that  the  probabilistic 
dropping  scheme  employed  by  CSFQ  can  significantly  affect  the  accuracy  of  the  test.  Then  we  give 
three  desirable  goals  for  the  identification  test,  and  discuss  the  trade-offs  to  achieve  these  goals. 

Our  flow  identification  test  makes  the  decision  based  on  the  relative  discrepancy  between  the 
rate  of  the  flow  estimated  by  the  router,  denoted  r,  and  the  labels  carried  by  the  flow’s  packets, 
denoted  f.  More  precisely,  the  relative  discrepancy  is  defined  as 

,.  r  —  r 

dlSrcl  —  Z  •  (7.1) 

r 

"^The  reason  for  this  assumption  is  that  a  constant  value  of  kinc  maximizes  the  excess  service  received  by  an  inconsis¬ 
tent  flow  without  increasing  the  probability  of  the  flow  to  be  caught.  We  show  this  in  Section  7.4.1. 
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Figure  7.5:  The  probability  density  function  (p.d.f.)  of  the  relative  discrepancy  of  estimating  the  flow  rate 
for  different  values  of  Av,,. 


In  an  idealized  fluid  flow  system  the  flow  identification  test  needs  only  to  see  whether  the  relative 
discrepancy  is  zero  or  not.  Unfortunately,  even  in  our  simplified  buffer-less  model  this  test  is  not 
good  enough,  primarily  due  to  the  inaccuracies  introduced  by  the  CSFQ  probabilistic  dropping 
scheme.  Even  if  a  flow  is  consistent  and  it  traverses  only  well-behaved  nodes,  it  can  still  end-up 
with  a  non-zero  discrepancy.  To  illustrate  this  point  consider  a  consistent  flow  with  the  overflow 
factor  kov  >  1  that  arrives  at  N\ .  Let  r  be  the  anival  rate  of  the  flow  and  let  a  be  the  fair  rate  of 
(A^i:A^2)-  Note  that  kov  ~  '^7^*  Assume  that  exactly  n  packets  of  the  flow  arrive  at  Ni  during  a 
monitoring  interval  Tjnon’>  and  the  packet  dropping  probability  is  independently  distributed.  Then, 
the  probability  that  Ni  will  forward  exactly  x  packets  during  Tmon  is: 


Pfwd{n\x) 


(7.2) 


where  p  =  a/ r  =  I /kov  represents  the  probability  to  forward  a  packet. 

Next,  assume  that  N2  monitors  this  flow.  Since  we  ignore  the  delay  jitter,  the  probability  that 
N2  receives  exactly  x  packets  during  Tmon  is  exactly  the  probability  that  will  forward  x  packets 
during  Tman^  i.e.,  p  fwd{n]  x).  Asa  result,  N2  estimates  the  flow  rate  asr  —  xl /Tmon  with  probabil¬ 
ity  x),  where  I  is  the  packet  length.  Let  m  denote  the  number  of  packets  that  are  forwarded 

at  the  fair  rate  a,  i.e.,  m  =  aTmon/l-  The  relative  discrepancy  of  the  flow  measured  by  N2  during 
Tmon  is  then* 


r  —  r  r  —  a  x  —  m 

dlSrel  ~  Z  ~  ~  • 

ram 


with  the  probability  Pfwd{‘>^',  x). 


(7.3) 
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Figure  7.5  depicts  the  probability  density  function  (p.d.f.)  of  dis,.,,!  for  iii  =  10  and  different 
values  of  fco,,.  Thus,  even  if  the  flow  is  consistent,  its  relative  discrepancy  as  measured  by  N2  can 
be  significant.  This  suggests  a  range  based  test  to  reduce  the  probability  of  false  positives,  i.e,  the 
probability  a  consistent  flow  will  be  classified  as  being  inconsistent.  In  particular,  we  propose  the 
following  flow  identification  test: 

Flow  Identification  Test  (CSFQ)  Define  two  thresholds:  Hi  <  0,  and  H,,  >  0,  respectively.  Then 
we  say  that  a  flow  is  inconsistent  if  its  relative  discrepancy  (di.s,;.i)  is  either  smaller  than  Hi,  or 
larger  than  H^. 

As  shown  in  Figure  7.5,  the  overflow  factor  has  a  significant  impact  on  the  flow’s  relative 
discrepancy  The  higher  the  overflow  factor  of  a  flow  is  (i.e.,  the  more  aggressive  a  flow 

is),  the  more  spread  out  its  relative  discrepancy  is.  A  spread  out  relative  discrepancy  makes  it  more 
difficult  to  accurately  identify  an  inconsistent  flow.  Assume  7/„  =  0.5,  that  is,  a  flow  will  be  classi¬ 
fied  as  inconsistent  whenever  the  measured  relative  discrepancy  {di.s,.,.i)  exceeds  0.5.  As  shown  in 
Figure  7.5,  in  this  case,  the  probability  a  consistent  flow  will  be  misidentified,  increases  significantly 
with  kop.  If  k„,,  <  1.05,  this  probability  is  virtually  0,  while  if  k„r  =  IG,  this  probability  is  about 
0.03,  which  in  practice  can  be  unacceptable.  Note  that  while  we  can  decrease  this  probability  by 
increasing  77,, ,  such  a  simple  solution  has  a  major  drawback:  a  large  H„  will  allow  flows  with  a 
larger  inconsistency  factor,  i.e.,  with  A:,„,  <  H„  -h  1,  to  go  by  undetected. 

To  simplify  the  problem  of  choosing  i7„,  we  assume  that  the  overflow  factor  of  a  consistent 
flow  has  an  upper  bound.  This  assumption  is  motivated  by  the  observation  that  consistent  flows  are 
likely  to  react  to  congestion,  and  therefore  their  overflow  factor  will  be  small.  Indeed,  unless  a  flow 
is  malicious,  it  makes  little  sense  for  the  source  to  send  at  a  rate  higher  than  the  available  rate  on  the 
bottleneck  link.  As  a  result,  we  expect  that  A:„,.  to  be  slightly  larger  than  1  in  the  case  of  a  consistent 
flow.  For  this  reason,  in  the  remaining  of  this  chapter  we  assume  that  k,,,,,  ,-  <  1.3.  The  value  of 

1.3  is  chosen  somewhat  arbitrary  to  coincide  to  the  value  used  in  [101]  to  differentiate  between 
well-behaved  and  malicious  (overly  aggressive)  flows.  Thus,  if  a  consistent  flow  is  too  aggressive, 

i.e.,  its  overflow  factor  kgp  >  1.3,  it  will  run  a  high  ri.sk  of  being  classified  as  inconsistent.  However, 
we  believe  that  this  is  the  right  tradeoff,  since  it  will  provide  an  additional  incentive  to  end-hosts  to 
implement  flow  congestion  control. 

7.3.3  Setting  threshold 

The  main  question  that  remains  to  be  answered  is  how  to  set  up  thresholds  H,,  and  Hi.  In  the 
remainder  of  this  section  we  give  some  guidelines  and  illustrate  the  trade-offs  in  choosing  Hy 
Since  choosing  Hi  faces  similar  trade-offs,  we  do  not  discuss  this  here. 

Setting  Hu  is  difficult  because  the  flow  identification  test  has  to  meet  several  conflicting  goals: 

1 .  robustness  -  maximize  the  probability  that  a  flow  identified  as  inconsistent  is  indeed  inconsis¬ 
tent.  We  denote  this  probability  by  pid- 

2.  sensitivity  -  minimize  the  inconsistency  factor  (ki„r)  for  which  a  flow  is  still  caught. 

3.  responsiveness  -  minimize  the  expected  time  it  takes  to  classify  a  flow  as  inconsistent.  As  a 
metric,  we  consider  the  expected  number  of  tests  it  takes  to  classify  a  flow  as  inconsistent. 
We  denote  this  number  by  nine- 
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In  the  remainder  of  this  section,  we  assume  that  router  Ni  misbehaves  by  forwarding  a  constant 
fraction  of  inconsistent  traffic  fine  to  N2.  The  reason  we  assume  that  fine  is  constant  is  because, 
as  we  will  show  in  Section  7.4.1,  this  represents  the  worst  case  scenario  for  our  flow  identification 
test.  In  particular,  if  a  malicious  router  wants  to  maximize  the  excess  bandwidth  received  by  its 
inconsistent  traffic  before  being  caught,  then  it  has  to  send  inconsistent  traffic  at  a  constant  fraction 
of  its  total  traffic.  Further,  we  assume  that  all  consistent  flows  have  kov  =  1-3,  and  all  inconsistent 
flows  have  kov  <  U  i-e.,  no  packet  of  an  inconsistent  flow  is  ever  dropped  by  Ni.  Asa  result  the 
relative  discrepancy  of  an  inconsistent  flow  will  be  exactly  kinc-  This  means  that  our  test  will  not 
be  able  to  catch  an  inconsistent  flow  with  kinc  <  Hy,  +  1.  Again,  this  choice  of  kov  represents  the 
worse-case  scenario.  It  can  be  shown  that  decreasing  the  kov  of  consistent  flows  and/or  increasing 
the  kov  of  inconsistent  flows  will  only  improve  the  test  robustness,  sensitivity,  and  responsiveness. 

Next,  we  derive  the  two  parameters  that  characterize  the  robustness  and  responsiveness;  pid, 
and  riinc-  By  using  Bayes’s  formula  and  the  notations  in  Table  7.1  we  have 

Pid  ~  Fr(S'j7ic  1  Cine)  (7.4) 

_  Pr(iS'jjic)Br(C"jnc|‘S^mc) 

J>r{Sinc)MCine\Sinc)  +  Pr(5c)Pr{Ci„c|-Sc) 

_  fine  ^  Pinc—inc 

fine  ^  Pinc—inc  "b  (1  fine)  ^  Pc—inc 

In  addition,  the  expected  number  of  flows  that  are  tested  before  a  flow  will  be  classified  as 
inconsistent,  nine,  is 


"^inc  —  ^  ^ Pine)'  Pine  ?  C*^) 

where  pinc  is  the  probability  that  a  selected  flow  will  be  classified  as  inconsistent.  With  the  notations 
from  Table  7.1,  and  using  simple  probability  manipulations,  we  have 

Pine  ~  PoPl(C'jTic)  (7-6) 

~  Po(Pr(C*inc  Fl  Sine)  "b  Pr(C'mc  FI  Sc)) 

~  Pa(Pr(<5i^c)Pr(Cinc  I  Sine)  "b  Pr(S'c)Pr(C*ijic  I  ‘S'c)) 

—  Pa  ^  {fine  ^  Pinc—inc  "b  (1  fine)  ^  Pc—inc) 

Finally,  by  combining  Eqs.  (7.5)  and  (7.6),  we  obtain 


Pa  ^  {fine  ^  Pinc—inc  "b  (1  fine)  ^  Pc—inc) 

As  illustrated  by  Eqs.  (7.4)  and  (7.7),  the  fraction  of  the  inconsistent  traffic,  fine,  has  a  critical 
impact  on  both  pid  and  runc-  The  smaller  fine  is,  the  smaller  pid  is,  and  the  larger  nine  is.  The 
reason  nine  increases  when  fine  decreases  is  because  in  any  valid  flow  identification  test  we  have 
Pinc-inc  >  Pc-inc,  i-c.,  the  probability  an  inconsistent  flow  will  be  identified  is  always  larger  than 
the  probability  a  consistent  flow  will  be  misidentified.  In  the  remainder  of  this  chapter,  we  choose  a 
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(a)  (b) 


Figure  7.6:  (a)  The  probability  to  identify  an  inconsistent  flow,  and  (b)  the  expected  number  of  tests  it 
takes  to  classify  a  flow  as  inconsistent,  as  functions  of  (The  values  of  for  <  1.25  and 

=  0.3  are  not  plotted  as  they  are  larger  than  UV'.)  All  inconsistent  flows  have  k\,r  -  L  fine  -  0.1,  and 
ni  =  10. 

somewhat  arbitrary  f  ,j^c  =  0.1.  If <  0.1,  we  simply  ignore  the  impact  of  the  inconsistent  traffic 
on  the  consistent  traffic.  This  decision  is  motivated  by  the  fact  that  the  inconsistent  traffic  can  “steal” 
at  most  a  fraction,  of  the  link  capacity.  In  the  worst  case,  this  leads  to  a  10%  degradation  in  the 
bandwidth  received  by  a  consistent  flow,  which  we  consider  to  be  acceptable.  However,  it  should  be 
noted  that  there  is  nothing  special  about  this  value  of  .  The  reason  for  which  we  use  a  specific 
value  for  fi^c  is  to  make  the  procedure  of  choosing  more  concrete. 

Without  loss  of  generality,  we  assume  pa  =  T  i.e.,  once  a  flow  is  selected,  it  remains  active  for 
at  least  Note  that  if  7;^;  <  1,  this  will  simply  result  in  scaling  up  iijac  by  l/pa^  Probabilities 

Pinc—inc  andpc-ijir  are  computed  based  on  the  p.d.f.  of  the  relative  discrepancy.  Figure  7.6(a)  plots 
then  the  probability  to  identify  an  inconsistent  flow,  while  Figure  7.6(b)  plots  the  expected  num¬ 
ber  of  tests  to  classify  a  flow  as  inconsistent,  n„„..  These  plots  illustrate  the  trade-offs  in  choosing 
Hjf  On  one  hand,  the  results  in  Figure  7.6(a)  suggest  that  we  have  to  choose  H„  >  0.2;  otherwise, 
the  probability  to  identify  an  inconsistent  flow  becomes  smaller  than  0.5.  On  the  other  hand,  from 
Figure  7.6(b),  it  follows  that  in  order  to  make  the  test  responsive  we  have  to  choose  <  0.2,  as 
riinc  starts  to  increase  hyper-exponentially  for  >  0.2.  To  meet  these  restrictions,  we  choose 
Hu  =  0.2.  Note  that  this  choice  gives  us  the  ability  to  catch  any  inconsistent  flow  as  long  as  its 
inconsistency  factor  (fc,:„f.)  is  greater  than  1.2. 

7.3.4  Increasing  Flow  Identification  Test’s  Robustness  and  Responsiveness 

In  the  previous  section  we  have  assumed  that  rn  =  10,  where  m  represents  the  number  of  packets 
that  can  be  sent  during  a  monitoring  interval  at  tbe  fair  rate  by  the  upstream  router  In  this 
section  we  study  the  impact  of  rn,  on  the  flow  identification  test  performances. 

Figures  7.7(a)  and  7.7(b)  plot  the  probability  to  identify  an  inconsistent  flow,  pi^,  and  the  ex¬ 
pected  number  of  tests  it  takes  to  classify  a  flow  as  inconsistent,  ni„c,  versus  for  various  values 
of  m.  As  shown  in  Figure  7.7(a),  increasing  m  can  dramatically  improve  the  test  robustness.  In 
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Figure  7.7:  (a)  The  probability  to  identify  an  inconsistent  flow,  and  (b)  The  expected  number  of  tests  to 
classify  a  flow  as  inconsistent,  nino  versus  inconsistency  factor,  kinc,  for  various  values  of  m. 

addition,  as  shown  in  Figure  7.7(b),  a  large  m  makes  the  identification  test  more  responsive  for 
values  of  kinc  closer  to  Hu^  However,  it  is  important  to  note  that,  while  a  large  m  can  signifi¬ 
cantly  reduce  riinc^  this  does  not  necessary  translate  into  a  reduction  of  the  time  it  takes  to  classify  a 
flow  as  inconsistent,  i.e.,  Tine-  In  particular,  if  a  router  monitors  M  flows  simultaneously,  we  have 
Tine  =  Tmon'^inc/ M ,  Thus,  an  increase  of  m  results  directly  into  an  increase  of  the  time  it  takes  to 
test  a  flow  for  consistency,  Tmon^  an  increase  which  can  offset  the  decrease  of 

7.4  Identifying  Misbehaving  Nodes 

Recall  that  our  main  goal  is  to  detect  misbehaving  routers  or/and  hosts.  In  this  section  we  present 
and  discuss  such  a  test. 

Like  a  flow  identification  test,  ideally,  a  node  identification  test  should  be  (1)  robust,  (2)  re¬ 
sponsive,  and  (3)  sensitive.  Unfortunately,  it  is  very  hard,  if  not  impossible,  to  achieve  these  goals 
simultaneously.  This  is  to  be  expected  as  the  node  test  is  based  on  the  flow  identification  test,  and 
therefore  we  are  confronted  with  the  same  difficult  trade-offs.  Worse  yet,  the  fact  that  the  probability 
of  a  node  misbehaving  can  be  very  low  makes  the  problem  even  more  difficult. 

Arguably,  the  simplest  node  identification  test  would  be  to  decide  that  an  upstream  node  mis¬ 
behaves  whenever  a  flow  is  classified  as  inconsistent.  The  key  problem  is  that  in  a  well  engineered 
.  system,  in  which  the  probability  of  a  misbehaving  node  is  very  small,  this  test  is  not  enough.  As 

an  example,  assume  that  we  use  m  =  20,  and  that  our  goal  is  to  detect  inconsistent  flows  with 
kinc  >  1-25.  Then,  according  to  the  results  in  Figure  7.7,  we  have  pid  =  0.92.  Thus,  there  is  a 
»  0.08  probability  of  false  positives.  In  addition,  assume  that  the  probability  that  the  upstream  node 

Ni  misbehaves  is  also  0.08.  This  basically  means  that  whenever  N2  classifies  a  flow  as  being 
inconsistent  there  is  only  a  0.5  chance  that  this  is  because  N\  misbehaves! 

To  alleviate  this  problem  we  propose  a  slightly  more  complex  node  identification  test:  instead 
of  using  only  one  observation  to  decide  whether  a  node  misbehaves  or  not,  we  use  multiple  obser¬ 
vations.  In  particular  we  have  the  following  test 
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Node  Identification  (CSFQ):  Test  1  An  upstream  node  is  assumed  to  misbehave  if  at  least  Uf 
flows  out  of  the  last  Nf  tested  flows  were  classified  as  inconsistent,  where  Uf  and  Nf  are  predefined 
constants. 

Let  P  denote  the  probability  of  false  positives,  that  is,  the  probability  that  every  time  we  identify 
an  inconsistent  flow  during  the  last  Nf  tests  we  were  wrong.  Assuming  that  the  results  of  the  flow 
identification  tests  are  independently  distributed,  we  have: 

(7.8) 

In  the  context  of  the  previous  example,  assume  that  Ni  —  5,  and  Uf  =  3.  This  yields  P  = 
0.0002.  Thus,  in  this  case,  the  probability  that  we  are  wrong  is  much  smaller  than  the  probability 
that  the  node  is  misbehaving.  In  consequence,  if  three  out  of  the  last  five  tested  flows  are  classified 
as  inconsistent  we  can  decide  with  high  probability  that  the  upstream  node  is  indeed  misbehaving. 

A  potential  problem  with  the  previous  test  is  that  a  large  threshold  Hy,  will  allow  a  misbehaving 
node  to  inflict  substantial  damages  without  being  caught.  In  particular,  if  a  node  inserts  in  the  packet 
headers  a  rate  that  is  kjjjc  times  larger  than  the  actual  flow  rate,  where  kij,c  —  1  <  Hu,  the  node  will 
get  up  to  kinc  —  1  times  more  capacity  for  free  without  being  caught.  For  example,  in  our  case,  when 
Hu  —  1.2,  the  upstream  node  can  get  up  to  20%  extra  bandwidth. 

To  filter  out  this  attack  we  employ  a  second  test.  This  test  is  based  on  the  observation  that  in  a 
system  in  which  all  nodes  are  welLbehaved  the  mean  value  of  is  expected  to  be  zero.  In  this 
case,  any  significant  deviation  of  from  zero  is  interpreted  as  being  caused  by  a  misbehaving 
upstream  node. 

Node  Identification  (CSFQ):  Test  2  An  upstream  node  is  assumed  to  misbehave  if  the  mean  value 
ofdiSrei  does  not  fall  within  [—(5,  S]. 

7.4.1  General  Properties 

In  this  section  we  present  two  simple  but  important  properties  of  our  identification  test.  The  first 
property  says  that  a  misbehaving  node  can  inflict  maximum  of  damage  when  it  sends  inconsistent 
traffic  at  a  constant  rate.  In  particular,  the  excess  bandwidth  received  by  the  inconsistent  traffic  at 
a  downstream  router,  before  being  caught,  is  maximized  when  the  rate  of  the  inconsistent  traffic  is 
constant.  This  property  is  important  because  it  allows  us  to  limit  our  study  to  the  cases  in  which  a 
the  fraction  of  the  inconsistent  traffic  sent  by  a  misbehaving  node  is  constant.  The  second  property 
says  that  a  misbehaving  node  cannot  hurt  the  downstream  consistent  traffic  in  a  “big”  way  for  a  long 
time.  In  other  words,  the  higher  the  rate  of  the  inconsistent  traffic  is,  the  faster  the  misbehaving 
node  is  caught.  The  two  properties  are  given  below. 

Property  1  Let  B,incXl'' denote  the  total  volume  of  inconsistent  traffic  received  by  a  router  at 
an  input  port  during  the  inteiyal  Then,  the  probability  that  no  inconsistent  flow  is  identified 

during  \t\  is  minimized  when  the  inconsistent  traffic  arrives  at  a  fixed  rate  rpff  ,  where  riff  = 
Rinc{txn/{t''^n 
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Proof.  Let  rinc{t)  be  the  rate  of  the  inconsistent  traffic  at  time  t,  and  let  C  denote  the  input  link 
capacity  where  the  traffic  arrives.  Then,  the  fraction  of  the  inconsistent  traffic  at  time  t  is 
Unc{t)/C,  where  the  equality  occurs  when  the  link  is  fully  utilized.  Let  ti,  t2,  be  the  time 

instants  when  a  new  flow  is  selected  to  be  monitored  during  the  interval  [t\  t”).  By  using  Eq.  (7.6), 
the  probability  that  none  of  these  flows  will  be  identified  as  being  inconsistent,  denoted  q,  is 


JJ(1  Pmc(fi)) 


(7.9) 


i=l 


<  1  “  Pa 


li finc{ii)  ^  Pinc—inc  (1  fincj^i))  ^Pc—inc)\ 
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Since  a  router  randomly  selects  a  flow  to  be  monitored,  we  assume  without  loss  of  generality  that 
ti,  t2,  ■  ■  ■,tn  are  independently  distributed  within  [t',  t").  By  assuming  that  rineXt)  is  a  continuous 
function  over  the  interval  [t',  t"),  we  have 

Vine  =  Exp  I  - 1  .  (7.10) 

In  addition,  since  finc{t)  >  r’mc(<) /C,  for  any  t,  the  following  inequality  follows  trivially 

fincjti)  y  fincitj)  (7  11) 

n  ~  nC 

By  combining  Eqs.  (7.9),  (7.1 1)  and  Eq.  (7.10),  using  the  fact  that  fine.  Pa,  Pinc-inc  and  Pc-inc  are 
independently  distributed  variables,  and  the  fact  that  Pinc-inc  >  Pc-inc,  we  obtain 

Exp(g)  <  ^1  -  Exp(pa)^Exp(pi„c-mc)  “  Exp(pa)  (^1  “  Exp(pc_j„c)^  (7.12) 

—  (1  Exp(pa)  X  fine  ^  ^^P{Pinc—inc)  ~  Exp(pa)  X  (1  fine)  ^  Exp(pc— me)) 

where  fine  =  ifncIC.  But  the  last  term  in  the  above  inequality  represents  exactly  the  expected 
probability  a  flow  will  not  be  classified  as  inconsistent  after  n  tests,  when  the  inconsistent  traffic 
arrives  at  a  fixed  rate.  This  concludes  the  proof  of  the  property.  □ 

Thus,  if  the  fraction  at  which  a  misbehaving  node  sends  inconsistent  traffic  fluctuates,  the  prob¬ 
ability  to  identify  an  inconsistent  flow  can  only  increase.  This  is  exactly  the  reason  we  have  consid¬ 
ered  in  this  chapter  only  the  case  in  which  this  fraction,  i.e.,  fine,  is  constant. 

Property  2  The  higher  the  rate  of  the  inconsistent  traffic  is,  the  higher  the  probability  to  identify 
an  inconsistent  flow  is. 

Proof.  The  proof  follows  immediately  from  Eq.  (7.4)  and  the  fact  that  in  a  well  designed  flow 
identification  test,  the  probability  an  inconsistent  flow  will  be  identified,  Pmc-inc,  should  be  larger 
than  the  probability  a  consistent  flow  will  be  misidentified,  Pc-inc-  □ 
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7.5  Simulation  Results 


In  this  section  we  evaluate  the  accuracy  of  the  flow  identification  test  by  simulation.  All  simulations 
were  performed  in  ns-2  [78],  and  are  based  on  the  original  CSFQ  code  available  at  http://www.cs.cmu 
edu/“istoica/csfq. 

Unless  otherwise  specified,  we  use  the  same  parameters  as  in  Section  4.4.  In  particular,  each 
output  link  has  a  buffer  of  64  KB,  and  the  buffer  threshold  for  CSFQ  is  set  to  16  KB.  The  averaging 
constant  to  estimate  the  flow  rate  is  K  =  100  ms,  and  the  averaging  constant  to  estimate  the  fair  rate 
is  Ka  =  200  ms.  In  all  topologies  we  assume  that  the  first  router  traversed  by  each  flow  estimates 
the  rate  of  the  flow  and  inserts  it  in  the  packet  headers. 

Each  router  can  verify  up  to  four  flows  simultaneously.  Each  flow  is  verified  for  at  least  200 
ms  or  10  consecutive  packets.  Core  routers  use  the  same  algorithm  as  edge  routers  to  estimate  the 
flow  rate.  However,  to  improve  the  convergence  of  the  estimation  algorithm,  the  rate  is  initialized 
to  the  label  of  the  first  packet  of  the  flow  that  arrives  during  the  estimation  interval.  As  discussed 
in  Section  7.3.3,  we  set  the  upper  threshold  to  =  1.2.  Since  we  will  test  only  for  downward- 
inconsistent  flows  -  the  only  type  of  inconsistent  flows  that  can  steal  service  from  consistent  flows 

-  we  will  not  use  the  lower  thi*eshold  Hi, 

In  general,  we  find  that  our  flow  identification  test  is  robust  and  that  our  results  are  quite  com¬ 
parable  to  the  results  obtained  in  the  case  of  the  bufferless  system  model. 

7.5.1  Calibration 

As  discussed  in  Section  7.3.2,  to  design  the  node  identification  test  -  i.e.,  to  set  parameters  Uf  and  Ni 

-  it  is  enough  to  know  (1)  the  probability  to  identify  an  inconsistent  flow,  and  (2)  the  expected 
number  of  tests  it  takes  to  classify  a  flow  as  inconsistent, 

So  far,  in  designing  our  identification  tests,  we  have  assumed  a  bufferless  system,  which  ignores 
the  inaccuracies  introduced  by  (1)  the  delay  jitter,  and  (2)  the  rate  estimation  algorithm.  In  this 
section  we  simulate  a  more  realistic  scenario  by  using  the  ns-2  simulator  [78].  We  consider  a 
simple  link  traversed  by  30  flows  out  of  which  three  are  inconsistent  and  27  are  consistent.  Note 
that  this  coiTesponds  to  a  ~  0.1.  Our  goal  is  twofold.  First  we  want  to  show  that  the  results 
achieved  by  simulations  are  reasonably  close  to  the  ones  obtained  by  using  the  buffeiiess  system. 
This  basically  says  that  the  bufferless  system  can  be  used  as  a  reasonable  first  approximation  to 
design  the  identification  tests.  Second,  we  use  the  results  obtained  in  this  section  to  set  parameters 
Ni  and  ut  for  the  node  identification  test. 

We  perfonn  two  experiments.  In  the  first  experiment,  we  assume  that  all  flows  are  constant  bit 
rate  UDPs.  Figure  7.8  (a)  plots  the  probability  while  Figure  7.8  (b)  plots  the  expected  number 
of  flows  that  are  tested  before  a  flow  is  classified  as  inconsistent.  As  expected,  when  kinc  <  1-2, 
Pid  1.  This  is  because  even  in  the  worst-case  scenario,  when  all  packets  of  a  consistent  flow 
are  forwarded,  the  flow’s  relative  discrepancy  will  be  no  larger  than  Hy  =  0.2.  However,  as  kjne 
exceeds  1.2,  the  probability  pid  reduces  significantly,  as  we  are  more  and  more  likely  to  classify  a 
consistent  flow  as  inconsistent.  In  addition,  pid  is  strongly  influenced  by  kmc.  The  larger  the  kmc 
is,  the  larger  pm  is.  This  is  because  when  kmc  increases  we  are  more  and  more  likely  to  catch  the 
inconsistent  flows.  Similarly,  as  shown  in  Figure  7.8  (b),  jimc  is  decreasing  in  kmc  and  increasing 
in  kmc* 
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Figure  7.8:  (a)  The  probability  to  identify  an  inconsistent  flow,  and  (b)  the  expected  number  of  tests  it 
takes  to  classify  a  flow  as  inconsistent.  Vine-  We  consider  30  flows,  out  of  which  three  are  inconsistent. 
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Table  7.2:  pid  and  nine  2is  a  function  of  the  parameter  kjne  of  the  inconsistent  flows.  We  consider  27 
consistent  TCP  flows  and  3  UDP  inconsistent  flows. 

One  thing  worth  noting  here  is  that  these  results  are  consistent  with  the  ones  achieved  in  the 
simplified  bufferless  system.  For  example,  as  shown  in  Figures  7.6(a)  and  7.8(a),  in  both  cases  the 
probability  pid  for  kmc  =  1-3,  kmc  =  1-25,  and  Hu  =  0.7,  is  somewhere  between  0.7  and  0.8. 
This  suggests  that  tuning  the  parameters  in  the  bufferless  system  represents  a  reasonable  first  order 
approximation. 

In  the  second  experiment  we  assume  that  all  the  consistent  flows  are  TCPs,  while  the  inconsistent 
flows  are  again  constant  bit  rate  UDPs.  The  results  are  presented  in  Table  7.2.  It  is  interesting  to 
note  that  although  the  TCP  flows  experience  a  dropping  rate  of  only  5%,  which  would  correspond 
to  kinc  =  105,  both  pid  and  Umc  are  significantly  worse  than  their  corresponding  values  in  the  case 
of  the  previous  experiment  when  kmc  =  105.  In  fact,  comparing  the  values  which  corresponds  to 
kinc  =  1.3  in  both  Figure  7.8(a)  and  Table  7.2,  the  behavior  of  TCPs  is  much  closer  to  a  scenario  in 
which  they  are  replaced  by  UDPs  with  kmc  =  1-3,  rather  than  kmc  =  105.  This  is  mainly  because 
the  TCP  burstiness  negatively  affects  the  rate  estimation  accuracy,  and  ultimately  the  accuracy  of 
our  identification  test. 

In  summary,  for  both  TCP  and  UDP  flows  we  take  pe  =  0.76  and  Umc  =  25.  This  means  that 
by  using  25  observations,  the  probability  to  identify  an  inconsistent  flow  with  kmc  >  1-3  is  greater 
or  equal  to  0.76. 
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Figure  7.9;  (a)  Topology  used  to  illustrate  the  protection  and  recovery'  aspects  of  our  scheme,  (b)  The 
aggregate  throughput  of  all  flows  from  router  1  to  router  5,  and  of  all  flows  from  router  2  to  router  5, 
respectively.  All  consistent  flows  are  UDPs  with  =  1.3.  The  inconsistent  flow  is  UDP  and  has 
~  ~  1.3. 


Finally,  for  the  node  identification  test  we  choose  iif  =  2  and  TV,  =  25.  It  can  be  shown  that  this 
choice  increases  the  probability  to  conectly  identify  an  inconsistent  node  to  over  0.9. 

7.5.2  Protection  and  Recovery 

In  this  section  we  illustrate  the  behavior  of  our  scheme  during  the  protection  and  recovery  phases. 
For  this  we  consider  a  slightly  more  complex  topology  consisting  of  five  routers  (see  Figure  7.9(a)). 
There  are  ten  flows  from  router  1  to  router  4,  and  another  10  flows  from  router  2  to  router  4.  We 
assume  that  router  1  misbehaves  by  forwarding  an  inconsistent  flow  with  >  1.  Again,  note  that 
in  this  case  the  fraction  of  the  inconsistent  traffic  is  ~  0.1.  Finally,  we  assume  that  router  I's 
misbehavior  is  transitory,  i.e.,  it  starts  to  misbehave  at  time  f  =  33  sec,  and  ceases  to  misbehave  at 
time  f  =  6G  sec.  The  entire  simulation  time  is  100  sec. 

In  the  first  experiment  we  assume  that  all  consistent  flows  have  =  1.3,  and  that  the  incon¬ 
sistent  flow  has  kinc  =  1-3,  and  kov  —  1.3.  Figure  7.9(b)  plots  the  aggregate  throughputs  of  all 
flows  from  router  1  to  router  5,  and  from  router  2  to  router  5,  respectively.  Note  that  during  the 
interval  (33.sec,  66.sec),  i.e.,  during  the  time  when  router  1  misbehaves,  the  flows  from  router  2  to 
router  5  get  slightly  higher  bandwidth.  This  illustrates  the  fact  that  even  for  =  1.3  (which  is 
the  the  minimum  kinc  we  have  considered  in  designing  the  node  identification  test),  our  algorithm  is 
successful  in  identifying  the  node  misbehavior,  and  in  protecting  the  well-behaved  traffic  that  flows 
from  router  2  to  router  5.  In  addition,  once  router  1  ceases  to  misbehave  at  A  =  6G  sec  our  recovery 
algorithm  recognizes  this  and  stops  punishing  the  traffic  forwarded  by  router  1 . 

In  Figures  7.10  (a)  and  (b)  we  plot  the  results  for  a  virtually  identical  scenario  both  in  the  case 
when  all  routers  employ  the  unmodified  CSFQ,  and  in  the  case  when  the  routers  employ  the  “verify- 
and-protect”  version  of  CSFQ.  The  only  difference  is  that  the  inconsistent  flow  now  has  kinr  =  2, 
instead  of  1.3.  The  reason  for  the  increase  of  ki„c  is  to  make  it  easier  to  observe  the  amount  of 
bandwidth  that  is  stolen  by  the  inconsistent  flow  from  the  consistent  flows  when  the  unmodified 
CSFQ  is  used.  This  can  be  seen  in  Figure  7.10  (a)  as  the  aggregate  throughput  of  all  flows  from 
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Figure  7.10:  Aggregate  throughputs  of  all  flows  from  router  1  to  router  5,  and  from  router  2  to  router  5, 
when  all  routers  implement  (a)  the  unmodified  CSFQ,  and  (b)  the  “verify-and-protect”  version  of  CSFQ.  All 
consistent  flows  are  UDPs  with  kinc  ~  1-3.  The  inconsistent  flow  is  UDP  with  kj^c  =  2,  and  kinc  ~  1- 

router  2  to  router  5  is  slightly  less  than  5  Mbps.  In  contrast,  when  the  “verify-and-protect”  version 
is  used,  these  flows  get  considerably  more  than  5Mbps.  This  is  because  once  router  3  concludes 
that  router  1  misbehaves,  it  punishes  all  its  flows  by  simply  multiplying  their  labels  by  kinc  —  2.  As 
a  result,  the  traffic  traversing  router  1  gets  only  33%,  instead  of  50%,  of  the  link  capacities  at  the 
subsequent  routers. 

Finally,  Figures  7.11  (a)  and  7.11(b)  plot  the  results  for  the  case  in  which  consistent  flows  are 
replaced  by  TCPs.  As  expected,  the  results  are  similar:  the  “verify-and-protect”  version  of  CSFQ  is 
able  to  successfully  protect  and  recover  after  router  1  stops  misbehaving. 


7.6  Summary 

While  the  solutions  based  on  SCORE/DPS  have  many  advantages  over  traditional  stateful  solutions 
(see  Chapter  4,  5,  6),  they  still  suffer  from  robustness  and  scalability  limitations  when  compared  to 
the  stateless  solutions.  In  particular,  the  scalability  is  hampered  because  the  network  core  cannot 
transcend  trust  boundaiies  (such  as  the  ISP-ISP  boundaries),  and  therefore  high-speed  routers  on 
these  boundaries  must  be  stateful  edge  routers.  The  lack  of  robustness  is  because  the  malfunctioning 
of  a  single  edge  or  core  router  could  severely  impact  the  performance  of  the  entire  network,  by 
inserting  inconsistent  state  in  the  packet  headers. 

In  this  chapter,  we  have  proposed  an  approach  to  overcome  these  limitations.  To  achieve  seal- 
ability  we  push  the  complexity  all  the  way  to  the  end-hosts.  To  address  the  trust  and  robustness 
issues,  all  routers  statistically  verify  whether  the  incoming  packets  carry  consistent  state.  We  call 
this  approach  “verify-and-protect”.  This  approach  enables  routers  to  discover  and  isolate  misbehav¬ 
ing  end-hosts  and  routers.  The  key  technique  needed  to  implement  this  approach  is  flow  verification 
that  allows  the  identification  of  packets  carrying  inconsistent  state.  To  illustrate  this  approach,  we 
have  described  the  identification  algorithms  in  the  case  of  Core-Stateless  Fair  Queueing  (CSFQ), 
and  we  have  presented  simulation  results  in  ns-2  [78]  to  demonstrate  its  effectiveness. 
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Figure  7.1 1:  Aggregate  throughputs  of  all  flows  from  router  1  to  router  5,  and  from  router  2  to  router  5, 
when  all  routers  implement  (a)  the  unmodified  CSFQ.  and  (b)  the  “verify-and-protect'’  version  of  CSFQ.  All 
consistent  flows  are  TCPs.  The  inconsistent  flow  is  UDP  with  =  2,  and  /i  ,,,,  ~  1. 


A  final  observation  is  that  the  "verify-and-protect’'  approach  can  also  be  useful  in  the  context 
of  stateful  solutions.  Based  only  on  the  traffic  a  router  sees,  these  solutions  are  not  usually  able  to 
tell  whether  there  is  a  misbehaving  up-stream  router  or  not.  For  example,  with  Fair  Queueing  there 
is  no  way  for  a  router  to  discover  an  up-stream  router  that  spuriously  drops  packets,  as  it  cannot 
differentiate  between  a  flow  whose  packets  were  dropped  by  a  misbehaving  router,  and  a  flow  who 
just  sends  fewer  packets.  Thus,  the  ‘Verify-and-protect”  approach  can  be  also  used  to  effectively 
increase  the  robustness  of  the  traditional  stateful  solutions. 
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Chapter  8 


Prototype  Implementation  Description 


In  Chapters  4, 5  and  6  we  have  presented  three  applications  of  our  SCORE/DPS  solution  to  provide 
scalable  services  in  a  network  in  which  core  routers  maintain  no  per  flow  state.  In  this  chapter, 
we  describe  a  prototype  implementation  of  our  solution.  In  particular,  we  have  fully  implemented 
the  functionalities  required  to  provide  guaranteed  and  flow  protection  services.  While  section  5.5 
presents  several  experimental  results,  including  overhead  measurements  of  our  prototype,  this  chap¬ 
ter  focuses  on  the  implementation  details. 

The  main  goal  of  the  prototype  implementation  is  to  show  that  it  is  possible  to  efficiently  de¬ 
ploy  our  algorithms  in  today’s  IPv4  networks  with  minimum  incompatibility.  To  prove  this  point 
we  have  implemented  a  complete  system,  including  support  for  fine  grained  monitoring,  and  easy 
configuration.  The  current  prototype  is  implemented  in  FreeBSD  v2.2.6,  and  it  is  deployed  in  a  test¬ 
bed  consisting  of  300  MHz  and  400  MHz  Pentium  II  PCs  connected  by  point-to-point  100  Mbps 
Ethernets.  The  test-bed  allows  the  configuration  of  a  path  with  up  to  two  core  routers. 

Although  we  had  complete  control  of  our  test-bed,  and,  due  to  resource  constraints,  the  scale 
of  our  experiments  was  rather  small  (e.g.,  the  largest  experiment  involved  just  100  flows),  we  have 
devoted  special  attention  to  making  our  implementation  as  general  as  possible.  For  example,  while 
in  the  current  implementation  we  re-use  protocol  space  in  the  IP  header  to  store  the  DPS  state,  we 
make  sure  that  the  modified  fields  can  be  fully  restored  by  the  egress  router.  In  this  way,  the  changes 
operated  by  the  ingress  and  core  routers  on  the  packet  header  are  completely  transparent  to  the 
outside  world.  Similarly,  while  the  limited  scale  of  our  experiments  would  have  allowed  us  to  use 
simple  data  structures  to  implement  our  algorithms,  we  try  to  make  sure  that  our  implementation 
is  scalable.  For  example,  instead  of  using  a  simple  linked  list  to  implement  the  CJVC  scheduler, 
we  use  a  calendar  queue  together  with  a  two-level  priority  queue  to  efficiently  handle  a  very  large 
number  of  flows  (see  Section  8.1). 

For  debugging  and  management  purposes,  we  have  implemented  full  support  for  packet  level 
monitoring.  This  allows  us  to  visualize  simultaneously  and  in  real-time  the  throughputs  and  the 
delays  experienced  by  flows  at  different  points  in  the  network.  A  key  challenge  when  implementing 
such  a  fine  grained  monitoring  functionality  is  to  minimize  the  interferences  with  the  system  oper¬ 
ations.  We  use  two  techniques  to  address  this  challenge.  First,  we  off-load  as  much  as  possible  of 
the  processing  of  log  data  on  an  external  machine.  Second,  we  use  raw  IP  to  send  the  log  data  from 
router’s  kernel  directly  to  the  external  machine.  This  way,  we  avoid  context-switching  between  the 
kernel  and  the  user  level. 

To  easily  configure  our  system,  we  have  implemented  a  command  line  configuration  tool.  This 
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tool  allows  us  (1)  to  configure  routers  as  ingress,  egress,  or  core,  (2)  set-up,  modify,  and  tear-down 
a  reservation,  and  (3)  set-up  the  monitoring  parameters.  To  minimize  the  interferences  between 
the  configuration  operations  and  data  processing,  we  implement  our  tool  on  top  of  the  Internet 
Control  Management  Protocol  (ICMP).  Again,  by  using  ICMP,  we  avoid  context-switching  when 
configuring  a  router. 

The  rest  of  the  chapter  is  organized  as  follows.  The  next  section  presents  details  of  the  main 
operations  implemented  by  the  data  and  control  paths.  Section  8.2  describes  encoding  techniques 
used  to  efficiently  store  the  DPS  state  in  the  packet  headers.  Section  8.3  presents  our  packet  level 
monitoring  tool,  while  Section  8.4  describes  the  configuration  tool.  Finally,  Section  8.5  concludes 
this  chapter. 


8.1  Prototype  Implementation 

For  all  practical  purposes,  a  FreeBSD  PC  implements  an  output  queueing  router.  Output  interfaces 
usually  employ  a  FIFO  scheduling  discipline,  and  drop-tail  buffer  management.  Upon  a  packet 
anival,  the  router  performs  a  routing  table  lookup,  based  on  the  packet  destination  address,  to  deter¬ 
mine  to  which  output  interface  the  packet  should  be  forwarded.  If  the  output  link  is  idle,  the  packet 
is  passed  directly  to  the  device  driver  of  that  interface.  Otherwise,  the  packet  is  inserted  at  the  tail 
of  the  packet  queue  associated  to  the  interface,  by  using  the  IF^enqueue  macro. 

When  the  output  link  becomes  idle,  or  when  the  network  card  can  accept  more  packets,  a  hard¬ 
ware  intenupt  is  generated.  This  interrupt  is  processed  by  the  device  driver  associated  with  the  net¬ 
work  card.  As  a  result,  the  packet  at  the  head  of  the  queue  is  dequeued  (by  using  the  ifjdequeue 
macro)  and  sent  to  the  network  card. 

Our  implementation  is  modular  in  the  sense  that  it  requires  few  changes  to  the  FreeBSD  legacy 
code.  The  major  change  is  to  replace  the  queue  manipulation  operations  implemented  by  IF_ENQUEUE 
and  IF  JDEQUEUE.  In  particular,  we  replace 

IF_ENQUEUE(5.ifp->if_snd,  mb_head)  ; 


with 


if  { i f p - >node_type ) 

dpsEnqueue (ifp,  (void  ** ) &mb_head) ; 

else 

IF_ENQUEUE  (Scifp- >if_snd,  mb_head)  ; 

In  other  words,  we  first  check  whether  the  router  is  configured  as  a  DPS  router  or  not,  and  if  yes, 
we  call  dpsEnqueue  to  enqueue  the  packet^  in  a  special  data  structure  maintained  by  DPS.  If  not, 
we  simply  use  the  default  ifjdequeue  macro.  Similarly,  we  replace 

IF_DEQUEUE(6cifp->if_snd,  mb_head)  ; 

hntemally,  FreeBSD  stores  packets  in  a  special  data  structure  called  mhuf.  An  ordinary  mbuf  can  store  up  to  108 
bytes  of  data.  If  a  packet  is  larger  than  108  bytes,  a  chain  of  mbufs  is  used  to  store  the  packet,  mb -head  represents  the 
pointer  to  the  first  mbuf  in  the  chain. 
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with 


if  {ifp->node_type) 

dpsDequeue  (ifp,  &nib_head)  ; 

else 

IF_DEQUEUE(&ifp->if_snd,  mb_head) ; 

In  the  current  implementation,  we  support  two  device  drivers:  the  Intel  EtherExpress  Pro/IOOB 
PCI  Fast  Ethernet  driver,  and  the  DEC  21040  PCI  Ethernet  Controller.  As  a  result,  the  above  changes 
are  limited  to  three  files:  if-ethersubr.c,  which  implements  the  Ethernet  device-dependent  function¬ 
ality,  and  if-de.c  and  if-fxp.c  that  implement  the  device-dependent  functionalities  for  the  two  drivers 
we  support. 

In  the  remainder  of  this  section,  we  present  implementation  details  for  the  operations  performed 
by  our  solution  on  both  the  data  and  control  paths.  All  the  processing  is  encapsulated  by  the  dpsEn- 
queue  and  dpsDequeue  functions.  The  implementation  is  about  5,000  lines. 

8.1.1  Updating  State  in  IP  Header 

Before  forwarding  a  packet,  a  core  router  may  update  the  information  carried  in  its  header.  For 
example,  in  the  case  of  Core-Stateless  Fair  Queue  (CSFQ),  if  the  estimated  rate  carried  by  a  packet  is 
larger  than  the  link’s  fair  rate,  the  router  has  to  update  the  state  carried  by  the  packet  (see  Figure  4.3). 
Similarly,  with  Core-Jitter  Virtual  Clock  (CJVC),  a  router  has  to  insert  the  “ahead  of  schedule” 
variable,  g,  before  forwarding  the  packet  (see  Figure  5.3). 

Updating  the  state  involves  two  operations:  encoding  the  state,  and  updating  the  IP  checksum. 
State  encoding  is  discussed  in  details  in  Section  8.2.  The  checksum  is  computed  as  the  I’s  comple¬ 
ment  over  the  IP  header,  and  therefore  it  can  be  easily  updated  in  an  incremental  fashion. 

8.1.2  Data  Path 

In  this  section  we  present  the  implementation  details  of  the  main  operations  performed  by  ingress 
and  core  routers  on  the  data  path:  packet  classification,  buffer  management,  and  packet  scheduling. 
Since  the  goal  of  our  solution  is  to  eliminate  the  per  flow  state  from  core  routers,  we  will  concentrate 
on  operation  complexity  at  these  routers.  (The  exception  is  packet  classification  which  is  performed 
by  ingress  routers  only.) 

8.1.2.1  Packet  Classification 

Recall  that  in  SCORE/DPS  only  ingress  routers  need  to  perform  packet  classification.  Core  routers 
do  not,  as  they  process  packets  based  only  on  the  information  carried  in  the  packet  headers. 

Our  current  prototype  offers  only  limited  classification  capabilities.^  In  particular,  a  class  is 
defined  hy  fully  specifying  the  source  and  destination  IP  addresses,  the  source  and  destination  port 
numbers,  and  the  protocol  type.  This  allows  us  to  implement  packet  classification  by  using  a  simple 
hash  table  [26],  in  which  the  keys  are  computed  by  xor-ing  all  fields  in  the  IP  header  that  are  used 
for  classification. 

^This  can  be  easily  fixed,  by  replacing  the  current  classifier  with  the  more  complete  packet  classifier  developed  in  the 
context  of  the  Darwin  project  at  Carnegie  Mellon  University  [19]. 
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Figure  8. 1 :  Data  stiuctures  used  to  implement  CJVC.  The  rate-regulator  is  implemented  by  a  calendar  queue, 
while  the  scheduler  is  implemented  by  a  two-level  priority  queue.  Each  node  at  the  second  level  (and  each 
node  in  the  calendar  queue)  represents  a  packet.  The  first  number  represents  the  packet's  eligible  time;  the 
second  its  deadline,  (a)  and  (b)  show  the  data  structures  before  and  after  the  system  time  advances  from  5  to 
6.  Note  that  all  packets  that  become  eligible  are  moved  in  one  operation  to  the  scheduler  data  structure. 

8. 1.2.2  Buffer  Management 

With  both  CJVC  and  CSFQ,  routers  do  not  need  to  maintain  any  per  flow  state  to  perform  buffer 
management.  In  particular,  with  CSFQ,  packets  are  dropped  probabilistically  based  on  the  flow 
estimated  rate  carried  by  the  packet  header,  and  the  link  fair  rate  (see  Figure  4.3  and  4.4).  In  the  case 
of  CJVC,  we  simply  use  a  single  drop-tail  queue  for  all  the  guaranteed  traffic.  This  is  made  possible 
by  the  fact  that,  as  we  have  discussed  in  Section  5.3.3,  a  relatively  small  queue  can  guarantee  with 
a  very  high  probability  that  no  packets  are  ever  dropped.  For  example,  for  all  practical  purposes,  a 
8,000  packet  queue  can  handle  up  to  one  million  flows! 

8. 1.2.3  Packet  Scheduling 

With  CSFQ,  the  scheduling  is  trivial  as  the  packets  are  transmitted  on  a  Fist-In-First-Out  basis.  In 
contrast,  CJVC,  is  significantly  more  complex  to  implement.  With  CJVC,  each  packet  is  assigned 
an  eligible  time  and  a  deadline  upon  its  arrival.  A  packet  becomes  eligible  when  the  system  time 
exceeds  the  packet’s  eligible  time.  The  eligible  packets  are  then  transmitted  in  the  increasing  order 
of  their  deadlines.  Similar  to  other  rate-controlled  algorithms  [112, 126],  we  implement  CJVC  using 
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a  combination  of  a  calendar  queue  and  a  priority  queue  [10]. 

The  calendar  queue  performs  rate-regulation,  i.e.,  it  stores  packets  until  they  become  eligible. 
The  calendar  queue  is  organized  as  a  circular  queue  in  which  every  entiy  coixesponds  to  a  time 
interval  [15].  Each  entry  stores  the  list  of  all  packets  that  have  the  eligible  times  in  the  range 
assigned  to  that  entiy.  When  the  system  time  advances  past  an  entiy,  all  packets  in  that  entry 
become  eligible,  and  they  are  moved  in  a  priority  queue.  The  priority  queue  maintains  all  eligible 
packets  in  decreasing  order  of  their  deadlines.  Scheduling  a  new  packet  reduces  then  to  the  selection 
of  the  packet  at  the  head  of  the  priority  queue,  i.e.,  the  packet  with  the  smallest  deadline. 

The  algorithmic  complexity  of  the  calendar  queue  depends  on  how  packets  are  stored  within  an 
entry.  If  they  are  maintained  in  a  linked  list,  then  the  insertion  and  deletion  of  a  packet  are  constant 
time  operations.  In  contrast,  insertion  and  deletion  of  a  packet  in  the  priority  queue  takes  O(logn), 
where  n  is  the  number  of  packets  stored  in  the  priority  queue  [26].  Thus  the  total  cost  per  packet 
manipulation  is  0(log  7^). 

While  the  above  implementation  is  quite  straightforwairi,  it  has  a  significant  drawback.  As 
described  above,  when  the  system  time  advances,  all  packets  in  an  entiy  become  simultaneously 
eligible,  and  they  have  to  be  transfeired  in  the  scheduler  priority  queue.  The  problem  is  that,  at 
least  theoretically,  the  number  of  packets  that  become  eligible  can  be  unbounded.  Moreover,  since 
the  packets  are  maintained  in  a  simply  linked  list,  we  need  to  move  them  one-by-one!  Thus,  if 
we  have  to  move  m  packets,  this  will  take  0(m  log  n).  Clearly,  in  a  large  system  this  solution  is 
unacceptable,  as  it  may  result  in  packets  missing  their  deadlines  due  to  the  large  processing  time 
required  to  move  all  packets  from  a  calendar  queue  entry. 

To  address  this  problem,  we  store  all  packets  that  belong  to  the  same  calendar  queue  entry  in 
a  priority  queue  ordered  by  the  packets’  deadlines.  This  way,  when  these  packets  become  eligible, 
we  can  move  all  of  them  in  the  scheduler  data  structure  in  one  step.  To  achieve  this,  we  change  the 
scheduler  data  structure  to  a  two-level  priority  queue.  Figure  8.1  illustrates  the  new  data  structures 
in  the  context  of  a  movement  operation.  Thus,  with  the  new  data  structures,  we  reduce  the  time  it 
takes  to  move  all  packets  from  a  calendar  queue  entry  into  the  scheduler  data  structure  to  0(log  n). 

A  final  point  worth  noting  is  that  the  number  of  packets,  n,  that  are  stored  at  any  time  in  these 
data  structures  is,  with  a  very  high  probability,  much  smaller  than  the  total  number  of  active  flows. 
As  discussed  in  Section  5.3.3,  this  is  an  artifact  of  the  fact  that  flows  are  aggressively  regulated  at 
each  node  in  the  SCORE  domain. 

8.1.3  Control  Path 

Providing  per  flow  isolation  does  not  require  any  operations  on  the  control  path.  In  contrast,  pro¬ 
viding  guaranteed  services  requires  a  signaling  mechanism  to  perform  per  flow  admission  control. 
However,  as  described  in  Section  5.4,  the  implementation  of  admission  control  at  the  core  routers  is 
quite  straightforward.  For  each  output  link,  a  router  maintains  a  value  B  that  is  updated  every  time 
a  new  packet  arrives.  Based  on  this  value,  the  router  periodically  updates  the  upper  bound  of  the 
aggregate  reservation  Rbound  (see  algorithm  in  Figure  5.6).  Finally,  to  decide  whether  to  accept  a 
new  reservation  or  not,  a  core  router  just  needs  to  perform  a  few  arithmetic  operations  (again,  see 
Figure  5.6). 
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8.2  Carrying  State  in  Data  Packets 


In  order  to  eliminate  the  need  for  maintaining  per  flow  state  at  each  router,  our  DPS  based  algorithms 
require  packets  to  cany  state  in  their  headers.  Since  there  is  limited  space  in  protocol  headers  and 
most  header  bits  have  been  allocated,  the  main  challenge  to  implementing  these  algorithms  is  to 
(a)  find  space  in  the  packet  header  for  storing  DPS  variables  and  at  the  same  time  remain  fully 
compatible  with  cuirent  standards  and  protocols;  and  (b)  efficiently  encode  state  variables  so  that 
they  fit  in  the  available  space  without  introducing  too  much  inaccuracy.  In  the  remainder  of  the 
section,  we  present  our  approach  to  address  the  above  two  problems  in  IPv4  networks. 

There  are  at  least  three  altematives  to  encode  the  DPS  state  in  the  packet  headers: 

1.  Use  a  new  IP  option.  (Note  that  DPS  has  been  already  assigned  IP  option  number  23  by  the 
Internet  Assignment  Number  Authority  (lANA)  [56].) 

From  the  protocol  point  of  view,  this  is  arguably  the  least  intmsive  approach,  as  it  requires 
no  changes  to  IPv4  or  IPv6  protocol  stacks.  The  downside  is  that  using  the  IP  option  can  add 
a  significant  overhead.  This  can  happen  in  two  ways.  First,  most  of  today's  IPv4  routers  are 
veiy  inefficient  in  processing  the  IP  options  [109].  Second,  the  IP  option  increases  the  packet 
length,  which  can  cause  packet  fragmentation. 

2.  Introduce  a  new  header  between  the  link  layer  and  the  network  layer,  similar  to  the  way  labels 
are  transported  in  Multi-Protocol  Label  Switching  (MPLS)  [17]. 

Like  the  previous  approach,  this  approach  does  not  require  any  changes  in  the  IPv4/IPv6 
protocol  stack.  However,  since  each  router  has  to  know  about  the  existence  of  the  extra  header 
in  order  to  con*ectly  process  the  IP  packets,  this  approach  requires  changes  in  all  routers,  no 
matter  whether  this  is  needed  to  correctly  implement  the  DPS  algorithms  or  not.  In  contrast, 
with  the  IP  option  approach,  if  a  router  does  not  understand  a  new  IP  option,  it  will  simply 
ignore  it.  In  practice,  this  can  be  an  important  distinction,  as  many  of  today's  core  routers  are 
typically  uncongested.  Thus,  if  we  were  to  implement  a  service  like  flow  protection,  with  the 
IP  option  approach  we  don’t  need  to  touch  these  routers,  while  with  this  approach  we  need  to 
change  all  of  them. 

An  additional  problem  is  that  this  approach  requires  us  to  devise  different  solutions  for  dif¬ 
ferent  link  layer  technologies.  Finally,  note  that  it  also  suffers  from  a  fragmentation  problem, 
since  the  addition  of  the  extra  header  will  increase  the  size  of  the  packet. 

Another  option  to  implement  this  approach  would  be  to  leverage  MPLS,  whenever  possible. 
In  particular,  in  a  network  that  implements  MPLS,  a  solution  would  be  to  use  an  extra  MPLS 
label  to  encode  the  DPS  state. 

3.  Insert  the  DPS  state  in  the  IP  header.  The  main  advantage  of  this  approach  is  that  it  avoids  the 
penalty  imposed  by  most  IPv4  routers  in  processing  the  IP  options,  or  the  need  of  devising 
different  solutions  for  different  technologies  as  it  would  have  been  required  by  introducing 
a  new  header  between  the  link  and  network  layers.  The  main  problem  however  is  finding 
enough  space  to  insert  the  extra  information. 
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While  the  first  two  approaches  are  quite  general  and  can  potentially  provide  large  space  for 
encoding  state  variables,  due  to  performance  and  easy  of  implementation  reasons,  we  choose  the 
third  approach  in  our  current  prototype.^ 

8.2.1  Carrying  State  in  IP  Header 

The  main  challenge  to  canying  state  in  the  IP  header  is  to  find  enough  space  to  insert  this  informa¬ 
tion  in  the  header  while  remaining  compatible  with  current  standards  and  protocols.  In  particular, 
we  want  the  network  domain  to  be  transparent  to  end-to-end  protocols,  i.e.,  the  egress  node  should 
restore  the  fields  changed  by  ingress  and  core  nodes  to  their  original  values.  To  achieve  this  goal,  we 
first  use  four  bits  from  the  type  of  service  (TOS)  byte  (now  renamed  the  Differentiated  Service  (DS) 
field)  which  are  specifically  allocated  for  local  and  experimental  use  [75].  In  addition,  we  observe 
that  there  is  an  ip -off  f\e\d  of  13  bits  in  the  IPv4  header  to  support  packet  fragmentation/reassembly 
which  is  rarely  used.  For  example,  by  analyzing  the  traces  of  over  1.7  million  packets  on  an  OC-3 
link  [77],  we  found  that  less  than  0.22%  of  all  packets  were  fragments. 

Therefore,  in  most  cases  it  is  possible  to  use  ip-off  field  to  encode  the  DPS  values.  This  idea 
can  be  implemented  as  follows.  When  a  packet  arrives  at  an  ingress  node,  the  node  checks  whether 
a  packet  is  a  fragment  or  needs  to  be  fragmented.  If  neither  of  these  are  true,  the  ip-off^e\d  in  the 
packet  header  will  be  used  to  encode  DPS  values.  When  the  packet  reaches  the  egress  node,  the 
ipjoff  is  cleared.  Otherwise,  if  the  packet  is  a  fragment,  it  is  foi-warded  as  a  best-effort  packet.  In 
this  way,  the  use  of  ip-offis  transparent  outside  the  domain.  We  believe  that  forwarding  a  fragment 
as  a  best-effort  packet  is  acceptable  in  practice,  as  end-points  can  easily  avoid  fragmentation  by 
using  a  Minimum  Transfer  Unit  (MTU)  discovery  mechanism.  Also  note  that  in  the  above  solution 
we  implicitly  assume  that  packets  can  be  fragmented  only  by  egress  nodes. 

In  summary,  we  have  up  to  17  bits  available  in  the  current  IPv4  header  to  encode  four  state 
variables  (see  Figure  8.4).  The  next  section  discusses  some  general  techniques  to  efficiently  encode 
the  DPS  state. 

8.2.2  Efficient  State  Encoding 

One  simple  solution  to  efficiently  encode  the  state  is  to  restrict  each  state  variable  to  only  a  small 
number  of  possible  values.  For  example  if  a  state  variable  is  limited  to  eight  values,  only  three  bits 
are  needed  to  represent  it.  While  this  can  be  a  reasonable  solution  in  practice,  in  our  implementation 
we  use  a  floating  point  like  representation  to  represent  a  wider  range  of  values.  To  further  optimize 
the  use  of  the  available  space  we  employ  two  additional  techniques.  First,  we  use  the  floating  point 
format  only  to  represent  the  largest  value,  and  then  represent  the  other  value(s)  as  a  fraction  of  the 
largest  value.  In  this  way  we  are  able  to  represent  a  much  larger  range  of  possible  values.  Second,  in 
the  case  in  which  there  are  states  which  are  not  required  to  be  simultaneously  encoded  in  the  same 
packet,  we  use  the  same  field  to  encode  them.  Next,  we  present  the  floating  point  like  format  used 
to  encode  large  values. 

Assume  that  a  is  the  largest  value  carried  by  the  packet,  where  a  is  a  positive  integer.  To 
represent  a  we  use  an  m  bit  mantissa  and  an  n  bit  exponent.  Since  a  >  0,  it  is  possible  to  gain  an 

^This  choice  can  be  also  seen  as  an  useful  exercise  that  forces  us  to  aggressively  encode  the  state  in  the  scarce  space 
available  in  the  IP  header. 


void  intToFP(int  val,  int  *inantissa, 
int  nbits  =  get_num_bits (val ) ; 
if  (nbits  <=  m)  { 

♦mantissa  =  val ; 

♦exponent  =  (1  <<  n)  -  1; 

}  else  { 

♦exponent  =  nbits  -  m  -  1; 
♦mantissa  =  (val  >>  ♦exponent)  - 


} 


} 


int  ♦exponent)  { 


(1  <<  m) ; 


int  FPToInt(int  mantissa,  int  exponent)  { 
int  tmp; 

if  (exponent  ==  ((1  <<  n)  -1)) 
return  mantissa; 
tmp  =  mantissa  |  (1  <<  m) ; 

return  (tmp  <<  exponent) 

} 


Figure  8.2;  The  C  code  for  converting  between  integer  and  floating  point  formats,  in  represents  the  number 
of  bits  used  by  the  mantissa;  ii  represents  the  number  of  bits  in  the  exponent.  Only  positive  values  are 
represented.  The  exponent  is  computed  such  that  the  first  bit  of  the  mantissa  is  always  I .  when  the  number  is 
>  2"' .  By  omitting  this  bit,  we  gain  an  extra  bit  in  precision.  If  the  number  is  <  2"'  we  set  by  convention  the 
exponent  to  2”  -  1  to  indicate  this. 

extra  bit  for  mantissa.  For  this  we  consider  two  cases:  (a)  if  a.  >  2"'  we  represent  a  as  the  closest 
value  of  the  form  u2”,  where  2”'  <  u  <  2”'"'"'.  Then,  since  the  ni  +  1-th  most  significant  hit  in 
the  us  representation  is  always  1,  we  can  ignore  it.  As  an  example,  assume  in  =  3,  u  =  4,  and 
a  =  19  =  10011.  Then  19  is  represented  as  18  =  u  x  2’’,  where  u,  =  9  =  1001  and  v  =  1.  By 
ignoring  the  first  bit  in  the  representation  of  n  the  mantissa  will  store  001,  while  the  exponent  will 
be  1.  (b)  On  the  other  hand,  if  a  <  2"',  the  mantissa  will  contain  a,  while  the  exponent  will  be 
2"  —  1.  For  example,  for  in  =  3,  n  =  4,  and  a  =  6  =  110,  the  mantissa  is  110,  while  the  exponent 
is  nil.  Converting  from  one  format  to  another  can  be  efficiently  implemented.  Figure  8.2  shows 
the  conversion  code  in  C.  For  simplicity,  we  assume  that  integers  are  truncated  rather  than  rounded 
when  represented  in  floating  point. 

By  using  m  hits  for  mantissa  and  n  for  exponent,  we  can  represent  any  integer  in  the  range 
[0..(2"’‘''^  ~  1)  (2^"“’)]  with  a  relative  error  bounded  by  (— 1/2”'"''',  1/2'"''"*).  For  example, 

with  7  bits,  by  allocating  3  for  mantissa  and  4  for  exponent,  we  can  represent  any  integer  in  the 
range  [1..15  x  2'*^]  with  a  relative  error  bounded  by  (-6.25%,  6.25%).  Note  that  these  bounds  are 
not  necessary  tight.  Indeed,  in  this  example,  the  worst  cases  occur  when  encoding  the  numbers 
271,  and  273,  which  both  have  a  mantissa  of  8.  In  particular,  271  =  100001111  is  encoded  as 
u  =  1000,  V  —  5,  and  has  a  relative  error  of  (8  x  2'^  —  271)/271  =  —0.0554  =  —5.54%,  while 
273  =  100010001  is  encoded  as  u  =  1001,  v  =  5,  and  has  a  relative  error  of  5.55%. 

If  another  value  6  <  a  is  carried  by  the  packet  we  store  it  as  the  fraction  /  =  b/a.  Assuming  that 
we  use  mi  bits  to  represent  /,  the  absolute  error  is  hounded  by  (— 1/(2(2'"*  —  1)),  l/(2(2”’*  —  1))). 
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ToS  byte  (DS  field)  fragment  offset  (ip_off)  field 

-  g  used  by  IP  forwarding 

or  by  Diffserv 
I  I  unused 

r~|  used  by  DPS 

16  13  12  0 

Figure  8.3:  For  carrying  DPS  state  we  use  the  four  bits  from  the  TOS  byte  (or  DS  field)  reserved  for  local 
use  and  experimental  purposes,  and  up  to  13  bits  from  the  IP  fragment  offset. 

The  —1  in  the  denominators  is  a  result  of  mapping  2'^^  values  to  [0,  1],  with  2^^  —  1  representing 
1.  Finally,  it  is  easy  to  show  that  by  representing  a  in  floating  point  format  with  m  bits  for  mantissa 
and  n  bits  for  exponent,  and  by  using  mi  bits  to  encode  6,  the  relative  error  of  a  +  &,  denoted 
RelErr{a  +  6),  is  bounded  by 

~  - 7T  <  RelErria.  +  6)  <  — -i—  +  - — — - -  (8.1) 

2m+i  2^1+1  —  2  V  '  /  2^1+^  —  2 

where  we  ignore  the  second  order  term  1  / (2^"^^  (2"^^+^  “2)). 

In  the  next  sections,  we  describe  in  detail  the  state  encoding  in  the  case  of  both  guaranteed  and 
flow  protection  services. 

S.23  State  Encoding  for  Guaranteed  Service 

With  the  Guaranteed  service,  we  use  three  types  of  packets  to  carry  the  DPS  state:  (1)  data  packets, 
(2)  dummy  packets,  and  (3)  reservation  request  packets.  Below,  we  describe  the  state  encoding  in 
each  of  these  cases: 

8.2.3.1  Data  packet 

As  described  in  Chapter  5,  there  are  four  pieces  of  state  that  need  to  be  encoded  in  a  data  packet: 
(1)  the  reserved  rate  r  or  equivalently  Z/r,  where  I  is  the  packet  length,  (2)  the  slack  delay  6,  as 
computed  by  Eq.  (5.10),  (3)  the  amount  of  time  g  by  which  the  packet  was  transmitted  ahead  of 
schedule  at  the  previous  node,  and  (4)  b,  as  computed  by  Eq.  (5.12).  The  first  three  pieces  of  state 
are  used  to  perform  packet  scheduling,  while  the  last  one  is  used  to  perform  admission  control.  All 
are  positive  values.  Figure  8.4  shows  the  state  encoding  in  this  case.  The  first  two  bits  represent  the 
code  that  identifies  the  state  format.  The  rest  of  the  15  bits  are  allocated  to  four  fields: 

•  a  l“bit  flag,  called  T,  that  specifies  how  the  next  field,  FI,  is  interpreted 

•  a  3-bit  field  FI.  If  T  =  0,  then  FI  encodes  the  b  value.  Otherwise,  it  encodes  (Z /r) /F3. 

•  a  4-bit  field  F2  that  encodes  g/F3 

•  a  7-bit  field  F3  that  encodes  Z/r  H-  5 . 

We  make  several  observations.  First,  since  F3  encodes  the  largest  value  among  all  fields,  we 
represent  it  in  floating  point  format.  By  using  this  format,  with  seven  bits  we  can  represent  any 
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Figure  8.4:  State  encoding  in  the  case  of  providing  guaranteed  service,  and  flow  protection. 


positive  number  in  the  range  [1..15  x  2^’*],  with  a  relative  error  within  (—6.25%,  6.25%).  Second, 
since  the  deadline  determines  the  delay  guarantees,  we  use  a  representation  that  trades  eligible  time 
accuracy  for  deadline  accuracy.'*  In  particular,  the  deadline  is  computed  as  d  =  current  J,im.c-\-F2* 
F3+F3  ~  currentJ,ime+()+l/r+6.  IfTisO,  the  eligible  time  is  computed  as  c  =  d~Fl*F3  ~ 
current  Jime  +  +  (5.  FI  uses  only  three  bits  and  its  value  is  computed  such  that  FI  *  F3  always 
overestimates  l/r.  If  Fis  I,  the  eligible  time  is  computed  simply  as  e  =  current  J.ime.  Third,  we 
express  b  in  1  KB  units.  In  this  way  we  eliminate  the  need  for  each  packet  to  carry  the  b  value.  In 
fact,  if  a  flow  sends  at  its  reserved  rate,  only  one  packet  in  every  eight  packets  needs  to  carry  the 
b  value.  This  observation,  combined  with  the  fact  that  the  under-estimation  of  the  packet  eligible 
time  does  not  affect  the  guaranteed  delay  of  the  flow,  allows  us  to  alternatively  encode  either  b  or 
{l/r)/FS  in  FI,  without  impacting  the  correctness  of  our  algorithms. 

S.2.3.2  Dummy  packet 

As  described  in  Section  5.4,  if  a  flow  does  not  send  any  data  packet  for  a  period  of  time  larger 
than  Tj,  the  ingress  router  has  to  send  a  dummy  packet.  In  practice,  this  can  be  either  a  packet 
newly  generated  packet  by  the  ingress,  or  a  best-effort  packet  that  happens  to  traverse  the  same 
path.  The  state  in  the  dummy  packet  has  the  code  1010000,  and  it  carries  only  the  b  value  in  the  FI 
field.  Note  that  using  a  long  code  leaves  extra  room  for  defining  other  state  formats  in  the  future. 
Dummy  packets  are  forwarded  with  a  priority  higher  than  the  priority  of  any  other  traffic  excepting 
the  guaranteed  traffic. 

''As  long  as  the  eligible  time  value  is  under-estimated,  its  inaccuracy  will  affect  only  the  scheduling  complexity,  as  the 
packet  may  become  eligible  earlier. 
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8.2.3.3  Reservation  request  packet 

This  packet  is  generated  by  an  ingress  router  upon  receiving  a  reservation  request.  The  ingress 
forwards  this  packet  to  the  coiresponding  egress  router  in  order  to  perform  the  requested  reservation. 
Upon  receiving  this  packet,  a  core  router  performs  admission  control,  and  updates  the  state  carried 
by  the  packet  accordingly.  When  an  egress  router  receives  a  reservation  request  packet,  it  sends  it 
back  to  the  ingress.  Upon  receiving  this  packet,  the  ingress  router  makes  the  final  decision.  Like  the 
dummy  packet,  the  reservation  request  packet  can  be  either  a  newly  generated  packet,  or  a  best-effort 
packet  that  happens  to  traverse  the  same  path. 

The  code  of  the  state  format  carried  by  this  packet  is  101.  The  packet  carries  four  fields  in  its 
header: 

•  a  1-bit  field,  D,  that  denotes  whether  the  packet  traverses  the  forward  or  the  backward  path. 
If  D  =  1,  the  packet  traverses  the  forwarding  path;  otherwise  the  packet  is  on  its  way  back  to 
the  ingress  router  that  generated  it. 

•  a  1-bit  field.  A,  which  specifies  the  current  state  of  the  reservation  request.  If  rf  =  1,  then 
all  previous  routers  have  accepted  the  reservation  request;  otherwise  at  least  one  router  has 
denied  it. 

•  a  5-bit  field,  FI,  that  stores  the  number  of  hops,  h,  on  the  forwarding  path.  The  field  is 
incremented  by  each  core  router  along  the  path.  The  number  of  hops  is  required  to  compute 
the  slack  variable  of  each  packet  at  the  ingress  router  (see  Eq.  (5.10)).  Note  that  the  current 
implementation  allows  a  path  with  at  most  32  hops.  While  this  value  does  not  cover  the 
lengths  of  all  end-to-end  routes  observed  in  today’s  Internet,  this  value  should  be  enough  for 
an  ISP  domain,  which  is  the  typical  SCORE  domain. 

•  a  7-bit  field  that  stores  the  requested  rate.  This  rate  is  stored  in  floating  point  format,  with  a 
3-bit  mantissa  and  a  4-bit  exponent,  and  is  expressed  in  1  Kbps  units. 

8.2.4  State  Encoding  for  LIRA 

With  LIRA,  we  use  only  one  type  of  packet  to  carry  the  DPS  state  information.  The  first  two  bits  are 
always  01,  and  represent  the  code  that  identifies  the  state  format.  The  rest  of  the  15  bits  are  divided 
into  three  fields  as  follows: 

•  a  1-bit  flag,  P,  called  the  preferred  bit.  This  bit  is  set  by  the  application  or  user  and  indicates 
the  dropping  preference  of  the  packet.  This  bit  is  immutable,  i.e.,  it  cannot  be  modified  by 
any  router  in  the  network. 

•  a  1-bit  flag,  M,  called  the  marking  bit.  This  bit  is  set  by  the  ingress  routers  of  an  ISP  and 
indicates  whether  the  packet  is  in-  or  out-of  profile.  When  a  preferred  packet  arrives  at  an 
ingress  node,  the  node  marks  it  if  the  user  has  not  exceeded  its  profile;  otherwise  the  packet  is 
left  unmarked.  Whenever  there  is  a  congestion,  a  core  router  always  drops  unmarked  packets 
first.  Irrespective  of  the  congestion  state,  core  routers  never  change  the  marked  bit.  The  reason 
we  use  two  bits  (i.e.,  the  marked  and  the  preferred  bits)  instead  of  one  is  that  in  an  Internet 


environment  with  multiple  ISPs,  even  if  a  packet  may  be  out-of  profile  in  some  ISPs  on  the 
earlier  portion  of  its  path,  it  may  still  be  in-profile  in  a  subsequent  ISP.  Having  the  preference 
bit  that  is  unchanged  by  upstream  ISPs  on  the  path  will  allow  downstream  ISPs  to  make  the 
con-ect  decision. 

•  a  13-bit  integer  that  represents  the  packet  label.  As  discussed  in  Section  6.3.4,  a  label  is 
computed  as  the  XOR  over  the  identifiers  of  all  routers  along  the  flow's  path.  The  length  of  a 
router’s  identifier  is  equal  to  the  label's  length,  i.e.,  1 3  bit.  The  packet  label  is  initialized  by  the 
ingress  router  and  updated  by  core  routers  as  the  packet  is  forwarded.  Note  that  there  is  a  non¬ 
zero  probability  that  two  labels  may  collide,  i.e.,  that  two  alternate  paths  between  the  same 
routers  have  the  same  label.  If  router  identifiers  are  uniformly  distributed,  the  probability  of 
collision  is  1/2'’^  =  1/8192.  While  this  probability  cannot  be  neglected  in  practice,  there 
are  two  points  worth  noting.  First,  this  problem  can  be  alleviated  by  having  the  ISP  choose 
the  router  identifiers  such  that  the  probability  of  label  collision  is  minimized.  Second,  note 
that  even  if  two  alternate  paths  have  the  same  label,  the  worst  thing  that  may  happen  is  an 
alternate  path  will  be  ignored.  Although  this  will  eventually  reduce  the  link  utilization,  it  will 
not  jeopardize  the  correctness  of  our  scheme. 

8.2.5  State  Encoding  for  CSFQ 

As  discussed  in  Chapter  4,  with  CSFQ,  data  packets  carry  only  one  piece  of  state:  the  estimated  rate 
of  the  flow.  In  this  case,  we  define  a  code  of  010000,  and  a  10-bit  field  that  encodes  the  estimated 
rate,  by  using  a  floating  point  format  with  a  5-bit  mantissa,  and  a  5-bit  exponent.  The  estimated  rate 
is  initialized  by  the  ingress  router  and  then  updated  by  core  routers  whenever  the  estimated  rate  is 
greater  than  the  fair  rate  on  the  output  link  (see  Section  4.3). 

By  using  a  long  code  for  the  CSFQ  state,  we  leave  considerable  room  for  defining  future  state 
formats. 

8.2.6  State  Encoding  Formats  for  Future  Use 

By  inspecting  the  state  encoding  formats  in  Figure  (8.4),  it  is  easy  to  see  that  there  is  significant 
room  for  future  extensions.  In  particular,  any  state  encoding  format  that  starts  with  001 ,  111,  1011, 
0001,  10101,00001,  101001,000001,  1010001  orOOOOOOl  is  available  for  future  use.  Thus,  even  by 
restricting  ourselves  to  reusing  the  space  in  the  IP  packet  header,  we  can  still  design  new  DPS  based 
algorithms  and  mechanisms  that  (at  least  at  the  level  of  the  state  encoding)  are  fully  compatible  to 
the  solutions  proposed  in  this  dissertation,  and  that  can  use  up  to  14  bits  for  specific  state  encoding. 


8.3  System  Monitoring 

When  implementing  a  network  service,  a  key  challenge  is  discovering  whether  the  service  is  work¬ 
ing  or  not,  and,  more  importantly,  if  it  doesn’t  work,  finding  out  why.  To  help  answer  these  questions 
we  need  a  monitoring  tool  that  can  accurately  expose  router  behavior.  Ideally,  we  want  the  ability 
to  monitor  router’s  behavior  in  real  time,  and  at  the  smallest  possible  granularity,  i.e.,  at  the  packet 
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Figure  8,5:  The  information  logged  for  each  packet  that  is  monitored.  In  addition  to  most  fields  of  the  IP 
header,  the  router  records  the  arrival  and  departure  times  of  the  packet  (i.e.,  the  ones  that  are  shaded).  The 
type  of  event  is  also  recorded,  e.g.,  packet  departure,  or  packet  drop.  The  DPS  state  is  carried  in  the  ToS  and 
the  fragment  offset  fields. 

level.  A  key  challenge  when  implementing  such  a  fine  grained  monitoring  functionality  is  to  mini¬ 
mize  the  interferences  to  the  monitored  system. 

A  possible  solution  that  fully  addresses  this  challenge  is  to  use  a  hub  or  a  similai*  device  be¬ 
fore/after  each  router’s  interface  to  replicate  the  entire  traffic  and  divert  it  to  an  external  monitoring 
machine  where  it  can  be  processed.  Unfortunately,  there  are  two  problems  with  this  solution.  First, 
it  is  very  hard  to  accurately  measure  the  arrival  and  the  departure  times  of  a  packet.  With  a  simple 
hub  we  can  infer  these  times  only  when  packets  arrive  at  the  monitoring  machine.  However,  the 
packet  processing  overhead  at  the  monitoring  machine  can  significantly  impact  the  accuracy  of  the 
time  estimates.  Even  with  a  special  device  that  can  insert  a  timestamp  in  each  packet  when  the 
packet  is  replicated,  the  problem  is  not  trivial.  This  is  because  the  arrival  and  departure  times  of 
a  packet  are  inserted  by  two  different  devices,  so  we  need  a  very  accurate  synchronization  clock 
mechanism  to  minimize  the  errors.  The  second  problem  with  this  solution  is  that  it  requires  an 
expensive  hardware  infrastructure. 

For  these  reasons,  in  our  implementation  we  use  an  alternate  approach.  In  particular,  we  in¬ 
strument  the  kernel  to  perform  packet  level  monitoring.  For  each  packet,  we  record  the  arrival  and 
the  departure  times  with  very  high  accuracy  using  the  Pentium  clock  counter.  In  addition  to  these 
times,  with  each  packet  we  log  most  of  the  fields  of  the  IP  header,  including  the  DPS  state  carried 
by  the  packet,  and  a  field,  called  log  type,  which  specifies  whether  the  packet  was  transmitted  or 
dropped.  To  minimize  the  monitoring  overhead,  we  use  the  ip_output  function  call  to  send  this 
information  directly  from  kernel  to  an  external  monitoring  machine.  Using  ip_output,  instead  of 
ioctl,  for  transferring  log  data,  avoids  unnecessary  context  switching  between  the  kernel  and  the 
user  level.  In  addition,  we  off-load  as  much  as  possible  of  the  log  data  processing  to  the  monitoring 
machine.  The  functions  performed  by  a  router  are  kept  to  minimum:  a  router  has  only  to  (1)  log 
packet  arrival  and  departure  times,  and  (2)  send  the  log  data  to  the  monitoring  machine.  In  turn,  the 
monitoring  machine  performs  the  bulk  of  data  processing,  such  as  packet  classification,  and  com¬ 
puting  the  rate  and/or  delay  of  a  flow.  To  further  reduce  the  monitoring  overhead,  we  provide  the 
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ability  to  sample  the  data  traffic.  In  particular,  we  can  configure  the  router  to  probabilistically  log 
one  out  of  every  2,  4,  16,  or  64  packets.  Finally,  to  eliminate  possible  interferences  between  the 
data  traffic  and  transfening  the  log  data,  we  use  a  different  network  interface  to  send  out  the  log 
information. 

To  visualize  the  monitoring  information,  we  have  developed  a  Graphical  User  Interface  (GUI) 
written  in  Java.  This  tool  offers  the  possibility  of  simultaneously  plotting  up  to  nine  graphs  in  real 
time.  Each  of  the  graphs  can  be  independently  configured.  In  particular,  the  monitoring  tool  allows 
us  to  specify  for  each  graph: 

•  the  network  interface  to  be  monitored.  This  is  specified  by  the  IP  address  of  the  router  to 
which  the  interface  is  attached,  and  the  IP  address  of  the  neighboring  router  that  is  connected 
to  the  interface. 

•  the  flows  to  be  monitored  at  the  selected  network  interface.  In  our  case,  a  flow  is  identified 
by  its  source  and  destination  IP  addresses,  its  source  and  destination  port  numbers,  and  the 
protocol  type. 

•  the  parameters  to  monitor.  Cun'ently  the  tool  can  plot  the  throughput  and  the  delay  of  a  flow. 
The  thi'oughput  is  averaged  over  a  specified  time  period.  The  delay  can  be  computed  either 
as  the  minimum,  maximum,  or  average  packet  delay  over  a  specified  period  of  time. 

This  flexibility  allows  us,  for  example,  to  simultaneously  visualize  the  throughput  and  per-hop 
delay  of  a  flow  at  different  hops,  to  visualize  the  throughput  and  the  delay  of  a  flow  at  the  same  hop, 
or  to  visualize  the  throughput  of  a  flow  at  multiple  time  scales.  As  an  example,  Figure  3.5  shows  a 
screen  snapshot  of  our  monitoring  tool.  The  top  two  plot  show  the  throughputs  of  three  flows  at  two 
consecutive  hops.  The  bottom  two  plots  show  the  delays  of  the  same  flows  at  the  same  hops. 


8.4  System  Configuration 

As  with  any  software  system,  we  need  the  ability  to  configure  our  system.  In  particular,  we  want 
the  ability  to  configure  a  router  as  an  ingress,  egress,  or  core,  and  to  set-up  the  various  parameters 
of  the  packet  classifier  and  packet  scheduler.  In  addition,  in  the  case  of  the  guaranteed  service,  we 
want  the  ability  to  set-up,  modify,  and  tear-down  a  flow  reservation.  One  possible  solution  would  be 
to  use  an  ioctl  function  call.  Unfortunately,  this  approach  has  a  significant  drawback.  Since  the 
execution  of  ioctl  is  quite  expensive  -  it  requires  context-switching  between  the  kernel  and  the 
user  level  -  it  can  seriously  interfere  with  the  processing  of  the  data  traffic.  At  the  limit,  in  the  case 
of  the  guaranteed  traffic,  this  can  result  in  packets  missing  their  deadlines. 

To  avoid  this  problem,  we  use  the  Internet  Control  Management  (ICMP)  protocol.  In  particular, 
we  have  written  a  simple  command  line  utility,  called  conf  ig_dps,  that  can  be  used  to  config¬ 
ure  routers  and  initialize  their  parameters  via  ICMP.  To  achieve  this,  we  have  extended  the  ICMP 
protocol  by  adding  two  new  messages:  icmp_dpsreq,  and  icmpidpsreply. 
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8.4.1  Router  Configuration 

So  far  we  have  classified  routers  as  edge  and  core  routers,  and  the  edge  routers  as  ingress  and  egress 
routers.  While  at  the  conceptual  level  this  classification  makes  sense,  in  practice,  this  distinction  is 
less  clear.  For  example,  an  edge  router  can  be  either  an  ingress  or  egress  depending  on  the  direction 
of  the  traffic  that  traverses  it.  As  a  result,  we  need  to  configure  at  a  finer  granularity,  i.e.,  at  the 
network  interface  level,  instead  of  at  the  router  level.  We  identify  an  interface  by  two  IP  addresses: 
the  address  of  the  I'outer  to  which  the  interface  is  attached,  and  the  address  of  the  downstream  router 
connected  to  the  interface.  We  then  use  the  following  command  to  configure  a  network  interface: 

config_dps  2  node  next__node  type 

The  first  parameter  of  the  command,  the  number  2,  specifies  that  an  interface  is  being  configured. 
The  next  two  parameters  specify  the  interface:  node  represents  the  IP  address  of  the  router  whose 
interface  is  being  configured,  and  next  mode  specifies  the  IP  address  of  the  downstream  router 
that  is  connected  to  the  interface.  Finally  type  specifies  whether  this  interface  is  configured  as 
belonging  to  an  ingress,  egress,  or  core  router. 

8.4.2  Flow  Reservation 

In  the  case  of  the  guaranteed  service  we  need  the  ability  to  set-up,  modify,  and  teai-down  a  flow 
reservation. 

To  set-up  a  new  reservation  we  use  the  following  command: 

config_dps  3  ingress  egress  src_addr  dst_addr  src_port  dst_port 
proto  rsv_rate  q^size 

The  first  parameter,  the  number  3,  specifies  that  this  command  requests  a  new  reservation.  The  next 
two  parameters  specify  the  ingress  and  egress  routers  which  are  the  end-points  of  our  reservation.  In 
addition,  we  use  these  pai*ameters  to  identify  the  interface  of  the  ingress  where  the  flow  state  is  to 
be  instantiated.  The  next  five  parameters,  src_addr,  dst.addr,  src.port,  dst-port,  and  proto, 
respectively,  identify  the  flow.  Finally,  the  last  two  parameters  specify  the  requested  reservation  (in 
Kbps),  and  the  buffer  size  to  be  allocated  at  the  ingress  for  this  flow  (in  packets). 

To  terminate  a  reservation,  we  use  the  following  command: 

config_dps  4  ingress  egress  src_addr  dst_addr  src_port  dstjort 

The  first  parameter,  the  number  4,  represents  the  code  of  the  deletion  operation.  The  other  parame¬ 
ters  have  the  same  meaning  as  in  the  previous  command. 

Finally,  the  command  to  update  a  reservation  is  virtually  identical  to  setting-up  a  reservation.  In 
fact,  this  operation  is  implemented  by  tearing  down  the  existing  reservation,  and  creating  a  new  one. 

8.4.3  Monitoring 

As  described  in  Section  8.3,  our  implementation  provides  support  for  fine  grained  monitoring.  To 
start  monitoring  the  traffic  at  a  network  interface,  we  use  the  following  command: 
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config_dps  0  node  next_node  mon_node  mon_port  level 

Again,  the  first  parameter,  the  number  0  in  this  case,  represents  the  operation  code.  The  next  two  pa¬ 
rameters  are  used  to  identify  the  interface  that  we  want  to  monitor,  while  monjnode  and  mon^port 
represent  the  IP  address  (of  the  monitoring  machine)  and  the  port  number  where  the  log  data  is  to 
be  transmitted.  Finally,  level  specifies  the  sampling  level.  The  routers  probabilistically  log  one 
packet  out  of  consecutive  packets  that  anive  at  the  interface.  To  stop  monitoring  the  traffic  at 
an  interface,  we  use  the  same  command,  but  with  the  second  parameter  set  to  1,  instead  of  0. 


8.5  Summary 

In  this  chapter,  we  have  presented  a  prototype  implementation  of  our  solution  to  provide  guaranteed 
and  flow  protection  services  in  an  IPv4  network.  To  the  best  of  our  knowledge  this  is  the  first  imple¬ 
mentation  that  provides  these  services  in  a  core  stateless  network  architecture.  We  have  described 
the  operations  performed  by  both  edge  and  core  routers  on  data  and  control  paths,  as  well  as  an 
approach  that  allows  to  efficiently  encode  the  DPS  state  in  the  IP  header  by  re-using  some  of  the 
existing  fields.  In  addition,  we  have  described  two  support  tools  that  allow  packet  level  monitoring 
in  real-time,  and  easy  system  configuration. 

We  have  implemented  our  solution  in  FreeBSD  2.2,6,  and  deployed  it  in  a  local  test-bed  that 
consists  of  four  PC  routers  and  up  to  16  hosts.  The  results  presented  in  Sections  3.2  and  5.5  show 
that  (1)  the  overhead  introduced  by  our  solution  is  acceptable,  i.e.,  we  can  still  saturate  the  100 
Mbps,  (2)  the  overhead  increases  very  little  with  the  number  of  flows  (at  least  when  the  number  of 
flows  is  no  larger  than  100),  and  (3)  the  scheduling  mechanisms  protect  the  guaranteed  traffic  so 
that  its  performance  is  not  affected  by  the  best-effort  traffic. 

While  implementing  this  prototype  we  have  learned  two  valuable  lessons.  First,  the  monitoring 
tool  proved  very  useful  not  only  in  debugging  the  system,  but  also  in  promptly  finding  unexpected 
causes  of  experiment  failures.  For  example,  in  more  than  one  instance,  we  were  able  to  easily 
discover  that  we  do  not  achieve  the  expected  end-to-end  performance,  simply  because  the  source 
does  not  correctly  generate  the  traffic  due  to  unexpected  interference  (such  as  someone  inadvertently 
using  the  same  machine).  Second,  the  fact  that  our  solution  does  not  require  the  maintenance  of 
distributed  per  flow  state  has  significantly  simplified  our  implementation.  In  particular,  we  were  able 
to  implement  the  entire  functionality  of  the  control  path,  which  is  notoriously  difficult  to  implement 
and  debug  in  the  case  of  the  stateful  solutions  such  as  RSVP  [128],  in  just  a  couple  of  days. 

However,  our  implementation  suffers  from  some  limitations  that  we  hope  to  address  in  the  fu¬ 
ture: 

•  The  current  test-bed  is  a  local  area  network.  We  would  like  to  perform  similar  tests  in  a  larger 
internetwork  environment  such  as  CAIRN  [16]. 

•  All  the  experiments  were  performed  with  synthetic  loads.  In  the  future,  we  would  like  to 
experiment  with  real  applications  such  as  video  conferencing  or  distributed  simulations. 

•  The  test-bed  consists  of  only  PC  workstations.  We  would  like  to  implement  our  algorithms  in 
commercial  routers. 
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•  The  current  configuration  tool  does  not  offer  any  protection.  Anyone  who  knows  the  ad¬ 
dress  of  our  routers  and  the  command  format,  can  send  ICMP  messages  to  re-configure  the 
routers.  Providing  an  enciyption  mechanism  to  avoid  malicious  router  re-configuration  would 
be  useful. 
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Chapter  9 


Conclusions  and  Future  Work 


In  this  chapter,  we  conclude  the  dissertation  by  (1)  summarizing  our  contributions,  (2)  exposing 
some  fundamental  limitations  of  current  solutions,  and  (3)  proposing  several  directions  for  future 
work. 


9.1  Contributions 

One  of  the  most  important  reasons  behind  the  overwhelming  success  of  the  Internet  is  the  stateless 
nature  of  its  architecture.  The  fact  that  routers  do  not  need  to  maintain  any  fine  grained  information 
about  traffic  makes  the  Internet  both  scalable  and  robust.  However,  these  advantages  come  at  a 
price:  today’s  Internet  provides  only  a  minimalist  service,  the  best  effort  datagram  delivery.  As  the 
Internet  evolves  into  a  global  communication  infrastructure  that  is  expected  to  support  a  plethora 
of  new  applications  such  as  IP  telephony,  interactive  TV,  and  e-commerce,  the  existing  best  effort 
service  will  no  longer  be  sufficient.  As  a  result,  there  is  an  urgent  need  to  provide  more  powerful 
services  such  as  guaranteed  services  and  flow  protection. 

Over  the  past  decade,  there  has  been  intense  research  toward  achieving  this  goal.  Two  classes  of 
solutions  have  been  proposed:  those  maintaining  the  stateless  property  of  the  original  Internet  (e.g.. 
Differentiated  Services),  and  those  requiring  a  new  stateful  architecture  (e.g..  Integrated  Services). 
While  stateful  solutions  ean  provide  more  powerful  and  flexible  services  such  as  per  flow  guaranteed 
services,  and  can  achieve  higher  resource  utilization,  they  are  less  scalable  than  stateless  solutions. 
In  particular,  stateful  solutions  require  each  router  to  maintain  and  manage  per  flow  state  on  the 
control  path,  and  to  perform  per  flow  classification,  scheduling,  and  buffer  management  on  the  data 
path.  Since  there  can  be  a  large  number  of  active  flows  in  the  Internet,  it  is  difficult,  if  not  impossible, 
to  implement  such  solutions  in  a  scalable  fashion.  On  the  other  hand,  while  stateless  solutions  are 
much  more  scalable,  they  offer  weaker  services. 

The  main  contribution  of  this  dissertation  is  to  bridge  the  long-standing  gap  between  stateful 
and  stateless  solutions.  To  achieve  this  goal,  we  have  described  a  novel  technique  called  Dynamic 
Packet  State  (DPS).  The  key  idea  behind  DPS  is  that,  instead  of  having  routers  maintain  per  flow 
state,  packets  carry  the  state.  In  this  way,  routers  are  still  able  to  process  packets  on  a  per  flow 
basis,  despite  the  fact  that  they  do  not  maintain  per  flow  state.  Based  on  DPS,  we  have  proposed  a 
network  architecture  called  Stateless  Core  (SCORE)  in  which  core  routers  do  not  maintain  any  per 
flow  state.  Yet,  by  using  DPS,  we  have  demonstrated  that,  in  a  SCORE  network,  it  is  possible  to 
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provide  services  which  are  as  powerful  and  flexible  as  the  services  provided  by  a  stateful  network. 
In  particular,  we  have  developed  complete  solutions  to  address  some  of  the  most  important  problems 
in  today’s  Internet: 

•  Flow  protection  (Chapter  4)  We  have  proposed  the  first  solution  to  provide  protection  on  a 
per  flow  basis  without  requiring  core  routers  to  maintain  any  per  flow  state.  To  achieve  this 
goal,  we  have  used  DPS  to  approximate  the  functionality  of  a  reference  network  in  which 
evei7  router  implements  the  Fair  Queueing  [31]  discipline  with  a  SCORE  network  in  which 
every  router  implements  a  novel  algorithm,  called  Core-Stateless  Fair  Queueing  (CSFQ). 

•  Guaranteed  services  (Chapter  5)  We  have  developed  the  first  solution  to  provide  per  flow 
delay  and  bandwidth  guarantees.  We  have  achieved  this  goal  by  using  DPS  to  emulate  the 
functionality  of  a  stateful  reference  network  in  which  each  router  implements  Jitter  Virtual 
Clock  [126]  on  the  data  path,  and  per  flow  admission  control  on  the  control  path  in  a  SCORE 
network.  To  this  end,  we  have  proposed  a  novel  scheduling  algorithm,  called  Core-Jitter 
Virtual  Clock  (CJVC),  that  provides  the  same  end-to-end  delay  bounds  as  Jitter  Virtual  Clock, 
but,  unlike  Jitter  Virtual  Clock,  does  not  require  routers  to  maintain  per  flow  state. 

•  Large  spatial  granularity  service  (Chapter  6)  We  have  developed  a  stateless  solution  that 
allows  us  to  provide  relative  differentiation  between  traffic  aggregates  over  large  numbers 
of  destinations.  The  most  complex  mechanism  required  to  implement  this  service  is  route- 
pinning,  for  which  traditional  solutions  require  routers  to  either  maintain  per  flow  state,  or 
maintain  state  that  is  proportional  to  the  square  of  the  number  of  edge  routers.  By  using  DPS, 
we  are  able  to  significantly  reduce  this  complexity.  In  particular,  we  propose  a  route-pinning 
mechanism  that  requires  routers  to  maintain  state  which  is  proportional  only  to  the  number  of 
egress  routers. 

While  the  above  solutions  have  many  scalability  and  robustness  advantages  over  existing  state¬ 
ful  solutions,  they  still  suffer  from  robustness  and  scalability  limitations  in  comparison.  System 
robustness  is  limited  by  the  possibility  that  a  single  edge  or  core  router  may  malfunction,  inserting 
erroneous  information  in  the  packet  headers,  severely  impacting  performance  of  the  entire  network. 
In  Chapter  7,  we  propose  an  approach,  called  “verify-and-protect”,  that  overcomes  these  limitations. 
We  achieve  scalability  by  pushing  the  complexity  all  the  way  to  the  end-hosts,  thus  eliminating  the 
distinction  between  edge  and  core  routers.  To  address  the  trust  and  robustness  issues,  all  routers 
statistically  verify  that  the  incoming  packets  are  correctly  marked.  This  approach  enables  routers  to 
discover  and  isolate  misbehaving  end-hosts  and  routers. 

To  demonstrate  the  compatibility  of  our  solutions  with  the  existing  protocols,  we  have  presented 
the  design  and  prototype  implementation  of  the  guaranteed  service  in  IPv4  networks.  In  Chapter  8 
we  propose  both  efficient  state  encoding  algorithms,  as  well  as  an  encoding  format  for  the  proposed 
solutions. 

The  SCORE/DPS  ideas  have  already  made  an  impact  in  both  research  and  industrial  commu¬ 
nities.  Since  we  have  published  the  first  papers  [101,  104],  several  new  and  interesting  results 
have  been  reported.  They  include  both  extensions  and  improvements  to  the  original  CSFQ  algo¬ 
rithm  [18,  25,  111],  and  generalizations  of  our  solution  to  provide  guaranteed  services  [129,  130]. 
In  addition,  DPS-like  techniques  have  been  used  to  develop  new  types  of  applications  such  as  IP 
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traceback  [91].  Furthermore,  we  observe  that  it  is  possible  to  extend  the  current  Differentiated  Ser¬ 
vice  framework  [32]  to  accommodate  algorithms  using  Dynamic  Packet  State.  The  key  extension 
needed  is  to  augment  each  Per  Hop  Behavior  (PHB)  with  additional  space  in  the  packet  header  for 
storing  PHB  specific  Dynamic  Packet  State  [107].  Such  a  paradigm  will  significantly  increase  the 
flexibility  and  capabilities  of  the  services  that  can  be  built  with  a  Diffserv-like  architecture. 


9.2  Limitations 

While  in  this  thesis  we  have  shown  that  by  using  the  DPS  technique  it  is  possible  to  implement  some 
of  the  most  representative  Internet  services  (for  which  previous  solutions  required  stateful  networks) 
in  a  SCORE  network,  one  important  question  still  remains:  What  are  the  limitations  of  SCORE/DPS 
based  solutions?  More  precisely,  is  there  any  service  implemented  by  stateful  networks  that  cannot 
be  implemented  by  a  SCORE  network?  In  this  section,  we  informally  answer  these  questions. 

In  a  stateful  router,  each  flow  is  associated  a  set  of  state  variables  such  as  the  length  of  a  flow’s 
queue  and  the  deadline  of  the  packet  at  the  head  of  the  flow’s  queue.  In  addition,  a  router  maintains 
some  global  state  variables  such  as  buffer  occupancy  or  utilization  on  the  output  link.  A  router 
processes  a  packet  based  on  both  the  per  flow  state  and  the  global  state  stored  at  the  router.  As  an 
example,  upon  a  packet  arrival,  a  router  checks  whether  the  buffer  is  full,  and  if  not,  discards  the 
packet  at  the  tail  of  the  longest  queue  to  make  room  for  the  new  packet. 

In  contrast,  in  SCORE,  a  core  router  processes  packets  based  on  the  state  carried  in  packet 
headers,  instead  of  per  flow  state  (as  these  routers  do  not  maintain  any  such  state).  Thus,  in  or¬ 
der  to  emulate  the  functionality  of  a  stateful  router,  a  stateless  router  has  to  reconstruct  the  per 
flow  state  from  the  state  carried  in  the  packet  headers.  The  question  of  what  are  the  limitations  of 
SCORE/DPS  based  solutions  reduces  then  to  the  question  of  what  types  of  per  flow  state  cannot  be 
exactly  reconstructed  by  core  routers.  We  are  aware  of  two  instances  in  which  current  techniques 
cannot  reconstruct  the  per  flow  state  accurately: 

1.  The  state  of  a  flow  depends  on  the  behavior  of  other  competing  flows.  Intuitively,  this  is 
because  it  is  very  hard,  if  not  impossible,  to  encode  the  effects  of  this  dependence  in  the  packet 
headers.  Indeed,  this  would  require  a  router  to  know  the  future  behavior  of  the  competing 
flows  at  the  next  router  before  updating  the  state  in  the  packet  headers. 

Consider  the  problem  of  exactly  emulating  the  Fair  Queueing  (FQ)  discipline.  Recall  that 
FQ  is  a  packet  level  realization  of  a  bit-by-bit  round  robin:  if  there  are  n  backlogged  flows, 
FQ  allocates  1  /n  of  the  link  capacity  to  each  flow.  In  the  packet  system,  a  flow  is  said  to  be 
backlogged  if  it  has  at  least  one  packet  in  the  queue. 

The  challenge  of  implementing  FQ  in  a  stateless  router  is  that  the  number  of  backlogged 
flows,  and  therefore  the  service  received  by  a  flow,  is  a  highly  dynamic  parameter,  i.e.,  it 
can  change  every  time  a  new  packet  arrives  or  departs.  Unfortunately,  it  is  very  hard  if  not 
impossible  for  a  router  to  accurately  update  the  number  of  backlogged  flows  when  such  an 
event  occurs.  To  illustrate  this  point,  consider  a  packet  arrival  event. 

When  a  packet  of  an  idle  flow  arrives,  the  flow  becomes  backlogged,  and  the  number  of  back- 
logged  flows  increases  by  one.  Because  the  router  does  not  maintain  per  flow  state,  the  only 
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way  it  can  infer  this  infoi-mation  is  from  the  state  earned  in  the  packet  header  (as  this  is  the 
only  new  information  that  enters  the  system).  However,  inserting  the  coirect  information  in 
the  packet  header  would  require  up-stream  routers  to  know  how  many  packets  are  in  the  flow's 
queue  when  the  packet  anives  at  the  current  node.  In  turn,  this  would  require  knowledge  of 
how  much  service  the  flow  has  received  at  the  cunent  node  so  far,  as  this  determines  how 
many  packets  are  still  left  in  the  queue.  Unfortunately,  even  if  we  were  using  a  feedback 
protocol  to  continuously  infoiTn  the  up-stream  routers  of  the  state  of  the  current  router,  the 
propagation  delay  poses  fundamental  limitations  on  the  accuracy  of  this  information.  For 
example,  during  the  time  it  takes  the  information  to  reach  the  up-stream  routers,  an  arbitrary 
number  of  flows  may  become  backlogged! 

2.  The  state  of  a  flow  depends  on  parameters  that  are  unique  to  each  router.  Intuitively,  this  is 
because  it  is  very  hard  to  reconstruct  these  parameters  at  each  router  given  the  limited  space 
in  the  packet  headers. 

Consider  a  router  that  implements  the  Weighted  Fair  Queueing  (WFQ)  scheduling  discipline. 
Similar  to  FQ.  WFQ  is  a  realization  of  a  weighted  bit-by-bit  round  robin:  a  flow  i  with  weight 
Wi  receives  (wi/ of  the  link  capacity,  where  B  is  the  set  of  backlogged  flows. 
Assume  that  the  flow  has  a  different  weight  at  every  router  along  its  path.  Then,  each  packet 
has  to  encode  all  these  weights  in  its  header.  Unfortunately,  in  the  worst  case,  encoding 
these  weights  requires  an  amount  of  state  proportional  to  the  length  of  the  path,  which  is  not 
acceptable  in  practice. 

Not  surprisingly,  these  limitations  are  reflected  in  our  solutions.  The  first  limitation  is  the  main 
reason  why  Core  Stateless  Fair  Queueing  is  only  able  to  approximate,  not  emulate,  the  Fair  Queue¬ 
ing  discipline.  Similarly,  it  is  the  main  reason  why  our  per  flow  admission  control  solution  uses 
an  upper  bound  of  the  aggregate  reservation,  instead  of  the  actual  aggregate  reservation  (see  Sec¬ 
tion  5.4).  Finally,  our  decision  to  use  a  non-work  conserving  discipline  such  as  Core-Jitter  Virtual 
Clock  to  implement  guaranteed  services  is  precisely  because  of  this  limitation.  In  particular,  the 
fact  that  the  service  received  by  a  flow  under  a  non-work  conserving  discipline  is  not  affected  by 
the  behavior  of  the  competing  flows,  allows  us  to  broke  the  dependence  between  the  flow  state  and 
the  behavior  of  the  competing  traffic.  This  makes  it  possible  to  compute  the  eligible  times  and  the 
deadlines  of  a  packet  at  all  core  routers,  as  soon  as  the  packet  arrives  at  the  ingress  router.  The 
potential  downside  of  a  non-work  conserving  discipline  is  the  inability  of  a  flow  to  use  additional 
resources  made  available  by  inactive  flows. 

As  a  result  of  the  second  limitation,  we  only  consider  the  cases  in  which  a  flow  has  the  same 
reserved  bandwidth  or  weight  at  all  routers  along  its  path  through  a  SCORE  domain.  In  the  case  of 
providing  guaranteed  services,  this  restriction  can  lead  to  lower  resource  utilization,  as  it  is  difficult 
to  efficiently  match  the  available  resources  at  each  router  with  the  flow  requirements. 

In  summary,  as  a  result  of  these  limitations,  our  SCORE/DPS  based  solutions  cannot  exactly 
match  the  performance  of  traditional  stateful  solutions.  In  spite  of  this,  as  we  have  demonstrated 
by  analysis  and  experimental  results,  our  solutions  are  powerful  enough  to  fully  implement  some  of 
the  most  popular  per  flow  network  services. 
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9.3  Future  Work 


In  the  next  sections,  we  identify  several  reseai'ch  directions  for  future  work.  Whenever  possible,  we 
try  to  emphasize  the  main  difficulties  and  possible  solutions  to  address  the  proposed  problems. 

9.3.1  Decoupling  Bandwidth  and  Delay  Allocations 

As  described  in  Chapter  5,  our  solution  to  provide  guaranteed  services  associates  a  single  parameter 
to  each  flow:  the  flow's  reseiwed  rate.  While  we  can  provide  both  per  flow  delay  and  bandwidth 
guai'antees  by  appropriately  setting  the  flow  reserved  rates,  the  fact  that  we  are  restricted  to  only  one 
parameter  may  lead  to  inefficient  resource  utilization. 

To  illustrate  this  point,  consider  a  flow  that  sends  traffic  at  a  constant  rate  r,  and  has  fixed  size 
packets  of  length  L  In  this  case,  the  worst  case  delay  experienced  by  a  packet  at  one  router  is  about^ 
//i?,  where  R  is  the  flow's  bandwidth  reservation.  Intuitively,  this  is  because  in  an  idealized  model 
in  which  the  flow  traverses  only  dedicated  links  of  capacity  R,  it  takes  a  router  exactly  l/R  time 
to  transmit  a  packet  of  size  1.  Assume  that  the  flow  requests  a  per  hop  delay  no  larger  than  D,  To 
meet  this  requirement,  the  router  has  to  allocate  a  bandwidth  i?,  such  that  l/R  <  D,  or  alternatively 
R  >  l/D,  In  addition,  R.  should  be  no  smaller  than  the  anival  rate  of  the  flow  r.  As  a  result,  for  a 
flow  with  the  arrival  rate  of  r  and  a  per  hop  delay  bound  of  Z),  a  router  has  to  allocate  a  bandwidth 
of  at  least  R  =  max(//Z?,  r  ).  Consider  a  64  Kbps  audio  flow  that  uses  1280  bit  packets,  and  has 
a  per  hop  delay  budget  of  D  =  b  ms.  To  meet  this  delay  requirement,  the  flow  needs  to  reserve  at 
least  R  =  l/D  =  1280  bit/ 5  ms  =  256  Kbps,  which  is  four  times  more  than  the  flow’s  rate!  Thus, 
using  only  one  parameter  can  result  in  serious  resource  underutilization. 

In  the  stateful  world,  several  solutions  have  been  proposed  to  address  this  problem  [90,  106, 
124].  A  future  direction  would  be  to  emulate  these  solutions  in  the  SCORE/DPS  framework.  The 
problem  is  that  current  solutions  to  decouple  bandwidth  and  delay  allocations  use  at  least  two  pa¬ 
rameters  to  specify  a  flow  reservation.  This  significantly  complicates  both  the  data  and  the  control 
path  implementations.  For  example,  admission  control  requires  checking  whether  a  two-piece  linear 
function  representing  the  new  reservation  ever  exceeds  an  n-piece  linear  function  representing  the 
available  link  resources  (capacity),  where,  in  the  worst  case,  n  represents  the  number  of  flows  [90]. 
Thus,  storing  the  representation  of  the  available  capacity  requires  an  amount  of  state  proportional  to 
the  number  of  flows,  which  is  unacceptable  for  a  stateless  solution.  A  possible  approach  to  alleviate 
this  problem  would  be  to  restrict  the  values  taken  by  the  parameters  that  characterize  flow  reserva¬ 
tions.  The  challenge  is  doing  this  without  compromising  the  flexibility  offered  by  decoupling  the 
bandwidth  and  delay  allocations. 

9.3.2  Excess  Bandwidth  Allocation 

Our  solution  to  provide  guaranteed  services  is  based  on  a  non-work  conserving  scheduling  algo¬ 
rithm,  i.e.,  CJVC.  As  a  result,  even  if  the  network  is  completely  idle,  a  guaranteed  flow  will  receive 
no  more  than  its  reserved  rate.  While  this  service  is  appropriate  for  many  applications  such  as  IP 

^More  precisely  the  worst  case  delay  is  I  jR  +  lmaxIC,  where  Imax  represents  the  maximum  length  of  any  packet  that 
traverses  the  link,  and  C  represents  the  link  capacity.  The  term  accounts  for  the  fact  that  the  packet  transmission 

is  not  preemptive.  However,  since  in  general  (7  >  r,  we  ignore  this  term  here. 


149 


telephony,  other  applications  such  as  video  streaming  would  prefer  a  more  flexible  service  in  which 
they  can  opportunistically  take  advantage  of  the  unused  bandwidth  to  achieve  better  quality.  In 
the  domain  of  stateful  solutions,  there  are  several  algorithms  including  variants  of  Weighted  Fair 
Queueing  [10,  48,  79],  and  Fair  Service  Curve  [106]  that  provide  the  ability  to  share  the  excess 
(unused)  bandwidth. 

In  this  context,  it  would  be  interesting  to  develop  stateless  algorithms  that  are  able  to  achieve 
excess  bandwidth  sharing,  while  still,  providing  guaranteed  services.  As  discussed  in  Section  9.2,  the 
main  problem  is  that  it  is  veiy  hard  for  DPS  algorithms  such  as  CJVC  to  adapt  to  very  rapid  changes 
^  of  excess  bandwidth  available  at  core  routers.  In  the  case  of  CJVC  this  is  because  the  scheduling 
parameters  are  computed  when  the  packet  anives  at  the  ingress  router,  a  point  at  which  it  is  veiy  hard 
if  not  impossible  to  accurately  predict  what  will  be  the  excess  bandwidth  when  the  packet  airives 
at  a  particular  core  router.  There  are  at  least  two  general  approaches  to  alleviate  this  problem:  (1) 
use  a  feedback  mechanism  similar  to  LIRA  to  inform  egress  routers  about  the  excess  bandwidth 
available  at  core  routers,  and  (2)  have  core  routers  compute  packet  scheduling  parameters  based 
on  both  the  state  earned  by  the  packet  headers  and  some  internal  state  maintained  by  the  router. 
The  main  challenge  of  the  first  approach  is  to  balance  the  freshness  of  the  information  maintained  at 
ingress  routers  regarding  the  excess  bandwidth  inside  the  network  with  the  overhead  of  the  feedback 
mechanism.  The  main  challenge  of  the  second  approach  is  to  maintain  the  bandwidth  and  delay 
guarantees  without  increasing  the  scheduling  complexity.  This  is  hard  because  the  complexity  of 
algorithms  such  as  CJVC  depend  directly  on  the  buffer  size  (see  Section  5.3.3),  and  the  buffer  size 
at  core  routers  will  significantly  increase  as  a  result  of  allowing  flows  to  use  excess  bandwidth  [79]. 

9.3.3  Link  Sharing 

While  most  of  the  previous  research  directed  at  providing  better  services  in  packet  switching  net¬ 
works  have  focused  on  providing  guaranteed  services  or  protection  for  each  individual  flow,  several 
recent  works  [1 1,  39,  92]  have  argued  that  it  is  also  important  to  support  hierarchical  link-sharing 
service. 

In  hierarchical  link-sharing,  there  is  a  class  hierarchy  associated  with  each  link  that  specifies 
the  resource  allocation  policy  for  the  link.  A  class  represents  a  traffic  stream  or  some  aggregate 
of  traffic  streams  that  are  grouped  according  to  administrative  affiliation,  protocol,  traffic  type,  or 
other  criteria.  Figure  9.1  shows  an  example  class  hierarchy  for  a  45  Mbps  link  that  is  shared  by  two 
organizations,  Carnegie  Mellon  University  (CMU)  and  University  of  Pittsburgh  (U.  Pitt).  Below 
each  of  the  two  organization  classes,  there  are  classes  grouped  based  on  traffic  types.  Each  class  is 
associated  with  its  resource  requirements,  in  this  case,  a  bandwidth,  which  is  the  minimum  amount 
of  service  that  the  traffic  of  the  class  should  receive  when  there  is  enough  demand. 

There  are  several  important  goals  that  the  hierarchical  link-sharing  service  aims  to  achieve. 
First,  each  class  should  receive  a  certain  minimum  amount  of  resource  if  there  is  enough  demand. 
In  the  example,  CMU’s  traffic  should  receive  at  least  25  Mbps  of  bandwidth  during  a  period  when 
the  aggregate  traffic  from  CMU  has  a  higher  arrival  rate.  Second,  at  each  level  of  the  hierarchy, 
active  children  should  be  able  to  use  the  excess  bandwidth  made  available  by  the  inactive  children. 
In  the  case  when  there  is  no  audio  or  video  traffic  from  CMU,  the  data  traffic  from  CMU  should  be 
able  to  use  all  the  bandwidth  allocated  to  CMU  (25  Mbps).  Finally,  we  should  be  able  to  provide 


150 


Distinguished  Distinguished 

lecture  lecture 


Figure  9.1:  An  Example  of  Link-Sharing  Hierarchy. 

both  bandwidth  and  delay  guarantees  to  leaf  classes,  eventually  by  decoupling  the  bandwidth  and 
delay  allocations.  In  the  example,  the  CMU  Distinguished  Lecture  video  and  audio  classes  are  two 
leaf  classes  that  require  both  bandwidth  and  delay  guarantees. 

In  short,  hierarchical  link^sharing  aims  to  provide  (1)  bandwidth  and  delay  guarantees  at  leaf 
classes,  (2)  bandwidth  guarantees  at  interior  classes,  and  (3)  excess  bandwidth  distribution  among 
children  classes.  Due  to  the  service  complexity,  it  should  come  as  no  surprise  that  all  current  solu¬ 
tions  require  routers  to  maintain  per  class  state  [11,  39,  92].  A  natural  research  direction  would  be 
to  implement  the  link-sharing  service  in  a  SCORE  network.  Providing  such  a  service  is  challenging 
because  we  have  to  deal  with  both  limitations  discussed  in  Section  9.2:  the  first  limitation  makes  it 
difficult  to  provide  excess  bandwidth  distribution;  the  second  limitation  makes  it  difficult  to  encode 
the  reservation  and  the  position  of  each  ancestor  class  the  packet  belongs  to  at  each  router  along 
its  paths.  A  possible  solution  would  be  to  come  up  with  “reasonable”  restrictions  that  allow  effi¬ 
cient  state  encoding  without  significantly  compromising  the  flexibility  and  the  utilization  offered  by 
existing  stateful  solutions. 


9.3.4  Multicast 

The  solutions  presented  in  this  dissertation  were  primarily  designed  for  unicast  traffic.  An  inter¬ 
esting  direction  for  future  work  would  be  to  extend  them  for  multicast.  In  the  case  of  CSFQ  this 
is  straightforward:  since  packet  replication  does  not  affect  the  flow  rate  along  an  individual  link, 
when  a  router  replicates  a  packet,  it  just  needs  to  copy  the  DPS  state  into  the  replica  headers.  In 
contrast,  extending  our  guaranteed  service  solution  to  support  multicast  would  be  more  difficult. 
Part  of  the  challenge  would  be  to  come  up  with  acceptable  service  semantics.  For  example,  do  we 
want  to  delay  all  packets  by  the  same  amount  no  matter  what  paths  they  traverse,  or  do  we  want  to 
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achieve  the  minimum  delay  on  each  individual  path?  Do  we  want  to  allocate  the  minimum  band¬ 
width  that  is  available  to  all  receivers,  or  do  we  want  the  ability  to  allocate  different  bandwidths 
on  different  paths?  One  important  observation  that  simplifies  the  problem  is  that  in  all  traditional 
multicast  solutions,  at  least  the  branching  routers  in  the  multicast  tree  have  to  maintain  per  group 
state.  By  leveraging  this  state,  it  would  be  possible  to  update  the  DPS  state  canned  by  the  packets  as 
a  function  of  the  branch  they  follow.  In  theory,  this  will  allow  us  to  provide  service  differentiation 
on  a  per  path  basis. 

Another  future  direction  in  the  context  of  multicast  would  be  to  use  DPS  to  implement  the 
multicast  service  itself.  A  straightforward  approach  would  be  for  the  sender  to  insert  the  list  of  all 
receivers’  IP  addresses  in  the  packet  headers.  This  would  eliminate  the  need  for  routers  to  keep 
any  multicast  state:  when  a  packet  annves,  a  router  inspects  the  list  of  the  addresses  in  the  packet 
header  and  replicates  the  packet  to  each  output  port  that  con*esponds  to  an  address  in  the  list.  While 
simple,  this  solution  has  a  fundamental  limitation:  the  state  earned  by  the  packets  increases  with  the 
number  of  receivers.  In  a  group  with  hundreds  of  receivers,  we  can  easily  reach  the  situation  when 
we  have  to  transmit  more  DPS  state  than  data  infonnation!  Note  that  this  problem  is  an  instance 
of  the  second  limitation  discussed  in  Section  9.2,  i.e.,  per  group  (flow)  routing  state  maintained  by 
traditional  multicast  schemes  is  unique  for  each  router.  A  possible  solution  would  be  to  find  a  more 
efficient  encoding  of  the  forwarding  state,  and,  eventually,  partition  it  between  the  packet  headers 
and  core  routers. 

93.5  Verifiable  End-to-End  Protocols 

As  described  in  Chapter  7,  the  ‘Verify-and-protect”  approach  can  overcome  some  of  the  robustness 
and  scalability  limitations  of  the  SCORE/DPS  framework.  We  believe,  however,  that  this  is  a  far 
more  general  and  powerful  approach  that  can  be  used  to  design  new  network  services  and  protocols. 
Usually,  whenever  we  implement  a  network  service  without  support  from  the  network,  we  make  the 
(implicit)  assumption  that  users  cooperate.  For  example,  the  recently  proposed  endpoint  admission 
control  algorithms  assume  that  ( 1 )  each  user  probes  the  network  to  detect  the  level  of  congestion, 
and  then  (2)  it  sends  traffic  only  if  the  level  of  congestion  is  sufficiently  low  [14].  Unfortunately,  in 
an  economic  environment  like  today's  Internet,  there  is  no  strong  incentive  for  users  to  cooperate. 
For  example,  a  user  may  choose  to  send  traffic,  even  if  the  congestion  level  is  high,  in  the  hope  that 
it  will  force  other  users  to  give  up  and  release  their  bandwidth.  A  natural  way  to  create  the  incentive 
for  users  to  cooperate  is  to  punish  them  if  they  don't.  However,  this  requires  the  ability  to  identify 
a  malicious  user,  i.e.,  the  ability  to  verify  its  behavior.  We  believe  that  verifiability  should  be  a  key 
property  of  any  end-to-end  protocol,  and  not  an  afterthought  as  it  happens  today.  In  this  context,  we 
believe  that  designing  protocols  and  algorithms  with  verifiable  behaviors  is  a  very  important  topic 
for  future  work. 

9.3.6  Incremental  Deployability 

SCORE/DPS  solutions  described  in  this  dissertation  require  changes  to  all  routers  within  a  network 
domain.  This  is  a  serious  limitation  that  may  delay  or  even  preclude  the  deployment  of  SCORE/DPS 
solutions  in  the  Internet.  An  important  direction  for  future  work  is  to  alleviate  or  remove,  if  possible, 
this  limitation. 
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One  approach  would  be  to  develop  solutions  that  require  only  a  subset  of  routers  to  be  changed. 
One  example  would  be  to  study  what  levels  of  bandwidth  and  delay  guarantees  can  be  provided  by 
a  network  in  which  only  the  edge  nodes  are  changed.  The  key  difficulty  would  be  to  coordinate 
the  actions  of  these  routers.  Furthermore,  this  coordination  would  need  to  happen  at  a  veiy  small 
time  scale,  because  of  the  rapid  changes  in  traffic  characteristics.  DPS  represents  an  ideal  starting 
point  to  the  development  of  such  mechanisms,  as  it  allows  the  exchange  of  traffic  information  at  the 
smallest  possible  granularity:  on  a  per  packet  basis. 

Another  approach  would  be  to  build  an  overlay  network  consisting  of  high  performance  sites, 
called  Point-of-Presence’s  (PoPs),  connected  by  high  quality  virtual  links.  An  example  would  be 
to  provide  per  flow  bandwidth  allocation  by  using  the  Premium  service  to  guarantee  capacity  along 
each  virtual  link,  and  a  CJVC  like  algorithm  to  manage  this  capacity  among  the  competing  flows. 
Another  example  would  be  to  monitor  the  available  bandwidth  along  each  virtual  link,  and  use 
this  information  together  with  a  CSFQ-like  algorithm  to  provide  fair  bandwidth  allocation  across 
the  overlay  network.  One  possible  approach  to  improve  the  quality  of  the  virtual  links  would  be 
to  construct  the  overlay  such  that  each  virtual  link  traverses  no  more  than  one  ISP.  There  are  two 
reasons  for  this  restriction.  First,  since  the  traffic  between  any  two  neighbor  PoPs  traverses  only  one 
IPS,  and  since  an  ISP  has  generally  full  control  of  its  resources,  it  would  be  much  easier  to  provide 
a  strong  semantic  service.  Second,  it  would  be  much  easier  to  verify,  and  therefore  to  enforce,  the 
service  agreement  between  the  neighbor  PoPs  and  the  ISP  that  handles  their  traffic.  If  one  of  the 
two  PoPs  detects  that  the  service  agreement  is  broken,  then  it  can  conclude  that  this  is  because  the 
ISP  that  carries  the  traffic  does  not  honor  its  agreement.  In  contrast,  had  the  traffic  between  the  two 
PoPs  traverse  more  that  one  ISP,  it  would  have  been  very  hard  to  identify  which  ISP  was  to  blame. 

9.3.7  General  Framework 

In  this  dissertation  we  have  demonstrated  by  examples  that  it  is  possible  to  provide  network  ser¬ 
vices  with  per  flow  semantics  in  a  stateless  network  architecture.  A  very  interesting  theoretical 
question  is:  what  is  the  class  of  algorithms  and  services  that  can  be  emulated  or  approximated  in 
the  SCORE/DPS  framework!  In  Section  9.2  we  informally  discuss  two  of  the  limitations  of  the  cur¬ 
rent  solutions.  The  next  step  would  be  to  develop  a  theoretical  framework  that  precisely  formulates 
the  limitations  and  answers  the  previous  question.  Such  a  framework  would  provide  us  with  a  much 
better  understanding  of  what  we  can  and  what  we  cannot  do  in  the  SCORE/DPS  framework.  A 
related  question  of  practical  interest  is:  can  we  come  up  with  a  general  methodology  that  allows  us 
to  transform  a  stateful  network  into  a  stateless  network  while  preserving  its  functionality! 


9.4  Final  Remarks 

In  this  dissertation,  we  have  presented  the  first  solution  that  can  provide  services  as  powerful  and  as 
flexible  as  the  ones  implemented  by  a  stateful  network  using  a  stateless  network.  To  illustrate  the 
power  and  the  generality  of  our  solution,  we  have  implemented  three  of  the  most  important  services 
proposed  in  the  context  of  today’s  Internet:  providing  guaranteed  services,  differentiated  services, 
and  flow  protection.  While  it  is  hard  to  predict  the  exact  course  of  research  in  this  area,  we  believe 
that  the  door  has  been  opened  to  many  new  and  challenging  problems  of  great  practical  importance 
and  theoretical  interest. 
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Appendix  A 


Performance  Bounds  for  CSFQ 


In  this  appendix  we  give  the  proof  of  Theorem  1  (Section  4.3.4).  Recall  that  for  simplicity  we  make 
the  following  assumptions:  (1)  the  fair  shai*e  a  is  fixed,  (2)  there  is  no  buffering  and  therefore  the 
drop  probability  is  given  by  Eq.  (4.2),  and  (3)  when  a  packet  airives,  a  fraction  of  that  packet  equal 
to  the  flow’s  forwarding  probability  is  transmitted.  The  proof  is  based  on  two  intermediate  results 
given  in  Lemmas  1  and  2,  respectively.  Lemma  1  gives  the  upper  bound  for  the  excess  service 
received  by  a  flow  with  weight  w  during  an  arbitrai7  time  period  in  which  the  estimated  rate  of  the 
flow,  as  given  by  Eq.  (4.3),  does  not  exceed  the  flow’s  fair  rate,  i.e.,  wa.  Similarly,  Lemma  2  gives 
the  upper  bound  for  the  excess  service  received  by  a  flow  during  an  arbitrary  period  in  which  the 
flow’s  estimated  rate  is  never  smaller  than  its  fair  rate. 

First,  we  give  two  well  known  inequalities  that  are  subsequently  used  in  the  proofs: 

>1,  X  >  0^  (A.l) 

<  1,  X  >  0.  (A.2) 

Lemma  1  Consider  a  link  with  a  normalized  fair  rate  a,  and  a  flow  with  weight  w.  The  excess 
service  received  by  the  flow  during  any  interval  I  =  [t',t"),  when  its  estimated  rate  r  does  not 
exceed  its  fair  rate  —  wa,  r{t)  <  wa,  Vt  G  /,  is  bounded  above  by 

T(^K  4"  ImaX’)  (A.3) 

where  Imax  represents  the  maximum  length  of  a  packet. 

Proof.  Without  loss  of  generality  assume  that  exactly  n  packets  of  the  flow  are  received  during  the 

interval  I.  Let  G  /,  (1  <  i  <  n)  be  the  arrival  time  of  the  i-ih  packet,  and  let  k  denote  its  length. 

According  to  Eq  (4.3),  we  have 

=  (1  -  1  <i  <n,  (A.4) 

■li 

where  Ti  =  ti  —  U-i,  and  ro  represents  the  initial  estimated  rate.  If  ro  >  0,  to  is  assumed  to  be  the 
time  when  the  last  packet  was  received  before  t'.  Otherwise,  if  ro  =  0  then  we  take  to  =  —oo. 

Since  by  hypothesis  Vi  <  {1  <  i  <  n)  it  follows  that  all  packets  are  forwarded  and  therefore 

the  total  number  of  bits  sent  during  the  interval  I  is  Ya=i  Thus,  our  problem  can  be  reformulated 
to  be 


X 

1  - 

xe~^ 

1  -  e-^ 
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subject  to 


From  Eq.  (A.4)  it  follows  that 

li  =  - 


max  ‘ 

(A.5) 

7'i  <  7*0  .  1  <  i  <  71. 

(A.6) 

'  T„  2  <7  <77. 

l_g-/,/A  -  - 

(A.7) 

The  above  equation  does  not  apply  for  *  =  1.  since  it  is  not  well  defined  for  the  case  in  which 
T’o  =  0.  Recall  that  in  this  case  we  take  to  =  -oc,  and  therefore  Ti  =  oc.  Further,  define 


F(ri.r2 . r„)  = 


(A.8) 


(=2 


Our  goal  then  is  to  maximize  Fjri.rj, - r„).  By  plugging  /,  from  Eq.  (A.7)  into  F(r\ ,  r-j, . . . , 

and  taking  the  derivate  with  respect  to  r',  (2  <  ?  <  ii)  we  obtain 


dF(ri.T2.. . .  ,T„) 


and 


Oi'i  1  —  1  —  ' 

c)F{r  1,7-2, - 7-„)  T„ 


2  <  i  <  11. 


(A.9) 


(A.  10) 


d7'„  1  — 

By  using  Eqs.  (A.l)  and  (A.2)  (and  after  making  substitutions  T,  — >  xyK,  Tj-^i  — >  x-^K,  and 
Tn  x:iK)  we  have 

dF(ri.r2:...  .r„) 


dri 


>0.  2  <  7  <  n. 


(A.ll) 


Thus,  F{ri ,  Ti,  •  •  • ,  7Vi)  is  maximized  when  7-2 . 7-3, . . . ,  r,,  achieve  their  maximum  value,  which  in 
our  case  is  r^.  Consequently,  we  have 


F{7'\  ■,7'2t  '  '  '  jt'n)  —  ^ 

i=2 


(A.  12) 
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i=3 


1  _  e-^’2/A' 


g-7^2/A'  « 

^  1  _  ’”2^2  +  ?’q  X! 

7  =  3 


<  Ff  7'q  +  To  ^  ^ 


i=2 


<  Kra  +  ra{t"-t'). 
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where  the  next  to  last  inequality  follows  from  the  fact  that  ri  >  0,  r2  <  Va,  and  by  using  Eq.  (A.2) 
after  substitution  T2  ->  xK. 

By  using  Eq.  (A.  12),  and  the  fact  that  li  <  Imax^  we  get 

n 

h  —  l\  F{7'[  ^  ^2,  *  •  • ,  r^q.)  <  Imax  F  Kr a  F  TQ{t  —■  t  ),  (A.  13) 

i=\ 

Since  (t"  —  f/)ra  represents  exactly  the  service  to  which  the  flow  is  entitled  during  the  interval  /,  it 
follows  that  the  excess  seiwice  is  bounded  by  Imax  +  □ 


Lemma  2  Consider  a  link  with  a  normalized  fair  rate  a,  and  a  flow  with  weight  w  that  sends  at  a 
rate  no  larger  than  where  B.  >  Next  consider  an  interval  I  =  [if  F)  such  that  i!  represents 
the  time  just  after  a  packet  has  arrived,  and  during  which  the  rate  estimator  r  is  never  smaller  than 
Tq,  =  wa,  i.e.,  r{t)  >  G  1.  Then,  the  excess  service  received  by  the  flow  during  I  is  bounded 

above  by 

TaKln-.  (A.14) 

r 


Proof.  Again,  assume  that  during  the  interval  I  the  flow  sends  exactly  n  packets.  Similarly,  let  U  be 
the  arrival  time  of  the  i-th  packet,  and  let  Ij  denote  its  length.  Since  we  assume  that  when  the  packet 
i  arrives,  sl  fraction  of  that  packet  equal  to  flow  i’s  forwarding  probability,  i.e.,  r^/r^  is  transmitted, 
the  problem  reduces  to  find  an  upper  bound  for 


(A.  15) 


where  Va  <  ri  <  R,  1  <  i  <  ii. 
From  Eq.  (A.4),  we  obtain 


h 

ri 


nil-e-Ti/K)  * 
/  r’i-iA 

=  - irJ 


(A.  16) 


Note  that  unlike  Eq.  (A.7),  the  above  equation  also  applies  for  i  =  1.  This  because  we  are 
guaranteed  that  there  is  at  least  one  packet  received  before  t'  and  therefore  T\  is  well  defined,  i.e., 
from  the  hypothesis  we  have  Ti  =  t\  —  Iq  =  t\  —  t'. 

Further,  by  making  substitution  x  TijK  in  Eq.  (A.2)  we  have 

^-TilK 

Ti>  0.  (A.17) 

From  the  above  inequality  and  Eq.  (A.  16)  we  obtain 

-  <Tj+ fl-— )  a:,  l<i<n,  (A.18) 

Ti  \  n  ) 


and  further 
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(A.  1 9) 


Next,  since  the  arithmetic  mean  is  no  smaller  than  the  geometric  mean,  i.e.,  .T/)  /'fi  > 

(njLi  where  Xj  >  0  (1  <  7*  <  ??.),  we  have 


n-s: 


y‘/-l 

n 


<  71  —  71 


1  /v 


(A.20) 


where  the  last  inequality  follows  from  the  hypothesis,  i.e.,  r,,  <  7',  </?(()<?<  ii). 

By  replacing  x  — >  (i?./7o)’/"  in  the  well  known  inequality  ln(T)  >1  —  1  /x,  (x  >  1),  we  obtain 
n(l  —  <  ln(/?/r-„),  v.  >  1.  Thus,  Eq.  (A.19)  becomes 


7t 


E 


<  In  —  . 


(A.21) 


Finally,  from  Eqs.  (A.19)  and  (A.21)  we  obtain 


77 


r_a 

Ti 


< 


< 


7  =  1 


rc^ii"  -t')+r^K\n- 

^  O 


(A.22) 


Since  —  i!)7'a  represents  exactly  the  number  of  bits  that  the  flow  is  entitled  to  send  during  the 
interval  /,  the  proof  follows.  □ 


Theorem  1  Consider  a  link  with  a  nonnalized  fair  rate  a,  and  a  flow  with  weight  w.  Then,  the 
excess  service  received  by  a  flow  with  weight  that  sends  at  a  rate  no  larger  than  R,  is  bounded 
above  by 

TtyK  ^1  +  ^niaxi  (A. 23) 

where  —  mv,  and  Imox  represents  the  maximum  length  of  the  packet. 


Proof.  Assume  the  flow  becomes  active  for  the  first  time  at  Let  f,  be  the  time  when  its  rate 
estimator  exceeds  for  the  first  time  i.e.,  r{tf)  >  Va  and  r{t)  <  Va,  Vf  <  If  such  a  time 
ti)  does  not  exist,  according  to  Lemma  1,  the  excess  service  received  by  the  flow  is  bounded  by 
r^K  +  Imax^  which  concludes  the  proof  for  this  case.  In  the  following  paragraphs,  we  consider  the 
case  when  4  exists. 

Next,  we  show  that  the  service  received  by  the  flow  is  maximized  when  r{t)  >  Va,  Vf  >  4*  The 
proof  is  by  contradiction.  Assume  there  is  an  interval  /'  =  [f',  f")  C  /,  such  that  f/  >  and  that 
r{t)  <  Tq,  (t'  <  f  <  f'O*  Then  using  an  identical  argument  as  in  Lemma  1,  it  can  be  shown  that  the 
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service  that  the  flow  receives  during  I'  increases  when  r{t)  =  ra,y  t  €  I'.  The  only  change  in  the 
proof  of  Lemma  1  is  that  now  Eq.  (A.7)  will  also  apply  for  i  =  1,  as  according  to  the  hypothesis 
the  estimated  rate  just  before  t'  (i.e.,  ro  in  Lemma  1)  is  greater  than  zero;  more  precisely  tq  >  r a- 
Further,  by  including  /i  in  the  definition  of  F{-)  (see  Eq.  (A.8))  we  show  that  F{-)  is  increasing  in 
each  of  its  arguments  rj,  1  <  i  <  n. 

Thus,  the  service  received  by  the  flow  is  maximized  when  the  estimated  rate  of  the  flow  is  no 
smaller  than  after  time  4.  But  then,  according  to  Lemma  2,  the  excess  service  received  by  the 
flow  after  4  is  bounded  by  * 

raKln—.  (A.24) 

'^a 

Similarly,  from  Lemma  1  it  follows  that  the  excess  service  received  by  the  flow  during  the 
interval  [ta^  4)  is  bounded  above  by 

Imax')  (A.25) 

and  therefore  by  combining  (A.24)  and  (A.25)  the  total  excess  service  is  bounded  above  by 

^max-  (A.26) 

□ 


^Without  loss  of  generality  here  we  assume  that  tb  represents  the  time  just  after  r  was  evaluated  as  being  smaller  than 
Tq  for  the  last  time.  Since  this  coincides  with  a  packet  arrival  Lemma  2  applies. 
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Appendix  B 


Performance  Bounds  for  Guaranteed  Services 


B.l  Network  Utilization  of  Premium  Service  in  Diffserv  Networks 

Premium  service  provides  the  equivalent  of  a  dedicated  link  of  fixed  bandwidth  between  edge  nodes 
in  a  Diffserv  network.  In  such  a  service,  each  premium  flow  has  a  reserved  peak  rate.  In  the  data 
plane,  ingress  nodes  police  each  premium  service  traffic  flow  according  to  its  peak  reservation  rate. 
Inside  the  Diffserv  domain,  core  routers  put  the  aggregate  of  all  premium  traffic  into  one  scheduling 
queue  and  service  the  premium  traffic  with  strict  priority  over  best  effort  traffic.  In  the  control 
plane,  a  bandwidth  broker  is  used  to  perform  admission  control.  The  idea  is  that  by  using  very 
conservative  admission  control  algorithms  based  on  worst  case  analysis,  together  with  peak  rate 
policing  at  ingress  nodes  and  static  priority  scheduling  at  core  nodes,  it  is  possible  to  ensure  that  all 
premium  service  packets  incur  very  small  queueing  delay. 

One  important  design  question  to  ask  is  how  conservative  does  the  admission  control  algorithm 
need  to  be?  In  other  words,  what  is  the  upper  limit  on  the  utilization  of  the  network  capacity  that  can 
be  allocated  to  premium  traffic  if  we  want  the  premium  service  to  achieve  the  same  level  of  service 
assurance  as  the  guaranteed  service,  such  that  the  queueing  delay  of  all  premium  service  packets  is 
bounded  by  a  fixed  number  even  in  the  worst  case? 

For  the  purpose  of  this  discussion,  we  use  flow  to  refer  to  a  subset  of  packets  that  traverse  the 
same  path  inside  a  Diffserv  domain  between  two  edge  nodes.  Thus,  with  the  highest  level  of  traffic 
aggregation,  a  flow  consists  of  all  packets  between  the  same  pair  of  ingress  and  egress  nodes.  Note 
that  even  in  this  case,  the  number  of  flows  in  a  network  can  be  quite  large  as  this  number  may 
increase  quadratically  with  the  number  of  edge  nodes. 

Let  us  consider  a  domain  consisting  of  4  x  4  routers  with  links  of  capacity  C.  Assume  that 
the  fraction  of  the  link  capacity  allocated  to  the  premium  traffic  is  limited  to  7.  Assume  also  that 
all  flows  have  equal  packet  sizes,  and  that  each  ingress  node  shapes  not  only  each  flow,  but  also 
the  aggregate  traffic  at  each  of  its  outputs.  Figure  B.l(a)  shows  the  traffic  pattern  at  the  first  core 
router  along  a  path.  Each  input  receives  12  identical  flows,  where  each  flow  has  a  reservation  of 
7(7/12  =  (7/48.  Let  r  be  the  transmission  time  of  one  packet,  then  as  shown  in  the  Figure,  the 
inter-arrival  time  between  two  consecutive  packets  in  the  each  flow  is  48r,  and  the  inter-arrival  time 
between  two  consecutive  packets  in  the  aggregate  flow  is  4r. 

Assume  the  first  three  flows  at  each  input  are  forwarded  to  output  1.  This  will  cause  a  burst 
of  12  packets  to  arrive  at  output  1  in  a  8r  long  interval  and  the  last  packet  of  the  burst  to  incur 
an  additional  delay  of  3t.  Now  assume  that  the  next  router  receives  at  each  input  a  traffic  pattern 
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Figure  B.  1 :  Per-hop  worst-case  delay  experienced  by  premium  traffic  in  a  Diffserv  domain,  (a)  and  (b)  show 
the  traffic  pattern  at  the  first  and  a  subsequent  node.  The  black  and  all  dark  grey  packets  go  to  the  first  output: 
the  light  grey  packets  go  to  the  other  outputs. 

similar  to  the  one  generated  by  output  1  of  the  first  core  router,  as  shown  in  Figure  B.l(b).  In 
addition,  assume  that  the  last  three  flows  from  each  input  burst  are  forwarded  to  output  1.  This  will 
cause  a  burst  of  12  packets  to  arrive  1  at  output  in  a  2t  long  interval  and  the  last  packet  in  the  burst 
to  incur  an  additional  delay  of  9t.  Thus,  after  two  hops,  a  packet  is  delayed  by  as  much  as  12t. 
This  pattern  can  be  repeated  for  all  subsequent  hops. 

In  general,  consider  a  fc  x  A;  router,  and  let  n  be  the  number  of  flows  that  traverse  each  link.  For 
simplicity,  assume  that  7  >  l/^^  Then  it  can  be  shown  that  the  worst  case  delay  experienced  by  a 
packet  after  h  hops  is 

D  ~  ^7^  —  1  —  —  1^  — ^  r  +  {h  —  1)11— — T  -f  hr.  (B.l) 

where  the  first  term  is  the  additional  delay  at  the  first  hop,  the  second  term  is  the  additional  delay  at 
all  subsequent  hops,  and  the  last  term  accounts  for  the  packet  transmission  time  at  each  hop.  As  a 
a  numerical  example,  let  C  =  1  Gbps,  a  packet  size  of  1500  bytes,  A:  —  16,  7  —  10%,  11  —  1500 
and  h  =  15.  From  here  we  obtain  r  =  12  /isec,  and  a  delay  D  of  over  240  ms.  Finally,  if  7  <  1//7 
it  can  be  shown  that  it  will  take  only  [log;,.(l/7)]  hops  to  achieve  a  continuous  burst.  For  example, 
for  7  =  1%  and  A;  =  16,  it  takes  only  two  hops  to  obtain  a  continuous  burst. 

The  above  example  demonstrates  that  low  network  utilization  and  traffic  shaping  at  ingress 
nodes  alone  are  not  enough  to  guarantee  a  “small'’  worst-case  delay  for  all  the  premium  traffic.  This 
result  is  not  surprising.  Even  using  a  per  flow  scheduler  like  Weighted  Fair  Queueing  (WFQ),  will 
not  help  to  reduce  the  worst  case  end-to-end  delay  for  all  packets.  In  fact,  if  all  flows  in  the  above 
example  are  given  the  same  weight,  the  worst  case  delay  under  WFQ  is  hnr,  which  is  basically  the 
same  as  the  one  given  by  Eq.  (B.l).  However,  the  major  advantage  of  using  WFQ  is  that  it  allows  us 
to  differentiate  among  flows,  which  is  a  critical  property  as  long  as  we  cannot  guarantee  a  “small” 
delay  to  all  flows.  In  addition,  WFQ  can  achieve  100%  utilization. 
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B.2  Proof  of  Theorem  2 


In  this  appendix  we  show  that  a  network  of  CJVC  servers  provides  the  same  end-to-end  delay 
guarantees  as  a  network  of  Jitter-VC  servers.  In  paiticulai*,  in  Theorem  2  we  show  that  the  deadline 
of  a  packet  at  the  last  hop  in  both  systems  is  the  same.  This  result  is  based  on  Lemmas  4  and  5 
which  give  the  expressions  of  the  deadline  of  a  packet  at  the  last  hop  in  a  network  of  Jitter-VC, 
and  a  network  of  CJVC  servers,  respectively.  First,  we  present  a  preliminary  result  used  in  proving 
Lemma  4. 


Lemma  3  Consider  a  network  of  Jitter-VC  servers.  Let  nj  denote  the  propagation  delay  between 
hops  j  and  j  4-  L  ond  let  rj  be  the  maximum  transmission  time  of  a  packet  at  node  j.  Then  for  any 
j  >  \  and  i.^k>lwe  have 


j+i 
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h3 


Tn  —  TT.-  >  —  4 


hJ 


H,j-\ 


-'^3- 


1  -  TTj.l- 


(B.2) 


Proof.  The  proof  is  by  induction  on  k.  First,  recall  that  by  definition  =  4,j  +  '0  ” 

Table  5.1),  and  that  for  j  >  1,  From  here  and  from  Eqs.  (5.1)  and  (5.2)  we 

have  then 


4.  =  max(4 


h3 


+  4,-1, 4 


)  +  ^ 


Ti 


ik 

=  max(4j_i  +  Tj_i  +  7rj_i,4“^)  +  f. 

•  i 


(B.3) 


Basic  Step.  For  k  =  1  and  any  j  >  1,  from  Eq.  (B.3)  we  have  trivially  djj  =  +  Tj_i  +  nj-i  + 

1} /rj,  Vj  >  1,  and  therefore  djj  —  cJjj-i  “  '’j'-i  “  ~  >  1- 

Induction  Step.  Assume  Eq.  (B.2)  is  true  for  k.  Then  we  need  to  show  that 


max(4+4  +  T,_i  +  7rj_i,4^^)  -  max(dHl2  +  +  7rj-2,dj'j_i) 

where  the  second  inequality  follows  after  using  Eq.  (B.3).  Next  consider  two  case 
+  TTj-l  <  or  not.  Assume  +  fj-i  +  ttj-i  <  dy-  From  Eq. 


F  TTj-i  <  dy  or  not.  Assume  dyl 
tion  hypothesis  we  obtain 

-Tj-T^j  =  max(dfj 
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jk 


Next,  assume  that 


+  TTj-i  >  d\y 

From  here  and  by  using  Eq.  (B.3)  and  Eq.  (B.4)  we  have 

,  =  max  ( '  +T.J+  TTj .  , )  - 


r, 


(B.6) 


(B.7) 


max(r/-'t_' ,  +  Tj_i  +  7rj_],r/fj)  -  Tj  -  tTj 
max(4+'  +  Tj  +  7rj,4j+i)  “  4j-i  “ 

4j '  -  +  '0-1  + 


Tj 

4j-i  -  inax(4+-^2  +  Oy-2  +  7rj-2.4,/-i) 
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(Eq.  (B.3)) 
(Eq.  (B.3)) 

(Eq.  (B.6)) 


4. 


4’T-i  -^j-i  -  0 


j-i- 


This  completes  the  proof.  □ 

Lemma  4  The  deadline  of  any  packet  ,  k  >  I,  at  the  last  hop  h  in  a  network  of  Jitter-VC  set^^ers 


IS 


(  lf\ 

di,h  =  max  ef_,  +  //,-!-  +  (t„,  +  tt,,,).  44  +  “  ’ 

\  m=l  ’  ^‘Z 

Proof.  Let  j*  >  1  be  the  last  hop  for  which  4 j*-]  +  ^j*-i  +  ^j*-i  <  ^47**'  consider  two 
cases  whether  ,)*  exists  or  not. 

Case  1.  (j*  does  not  exist)  From  Eq.  (5.1)  we  have  =  4.j-i  +  'O-i  +  ^j-i’  >  E  From  here 

and  by  using  Eq.  (5.2)  we  obtain 

4./i  ~  +  T/  (E.9) 

rH  =  l 

Because  we  assume  that  j*  does  not  exist  we  also  have  4’/,  =  4/i  +  which 

concludes  the  proof  of  this  case. 

Case  2.  (j*  exists)  In  this  case  we  show  that  j*  =  h.  Assume  this  is  not  true.  Then  we  have 
e’lj  =  d\j_^  +  Tj-\  +  TTj-i,  Vj  >  j*.  By  using  Eq.  (5.2)  we  obtain 

ik 

4,/j  =  4j‘  +  — i*  + 1)  —  +  Y,  (B.io) 

Ti 

m=j* 


174 


On  the  other  hand,  by  the  definition  of  j*  and  from  Eqs.  (5.1)  and  (5.2)  we  have 


0 

dlj.  =  max(4j*-i  +  Tj*-i  +  TTj.-i,  4~.^)  + 

^  i 
ik 

>  A,j--\  +  '0'*-i  +  ’’■j*-!  + 


(B.ll) 


As  aresult  we  obtain  dfj.  —  d,^j._i  —  Tj._i  —  7rj>_i  >  k/ri.  By  iteratively  applying  LemmaS  we 
have 

I- 

di,m+l  -  4m  -  An  -  7r,„  >  4j*  “  4f-l  “  'O'-l  “  >  f-’  ^  /  (^-^2) 

•i 

From  Eq.  (B.  12)  we  obtain 


h-l 


E  (4m  +  i  -  4m  -  An  -  7r,„)  >  {h  -  / )(<^.,  -  -  7r,-._i)  (B.13) 
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n 


(B.14) 


(B.15) 


where  the  right-hand  terni  can  be  expressed  as 

h-l  h-l 

y~!  (4m+l  ~  4m  ”  '^rn  ~  '^m)  =  4/i  ~  4j*  ~  IE  '^m)- 

m=j*  m=i* 

By  combining  Eq.  (B.13)  and  Eq.  (B.14)  we  get 

4/1  >  4j* +  “■?*)“+ E  + 

m~j* 

~  4j*  +  (^  ~  j*  + +  E 

m=i* 

But  this  inequality  contradicts  Eq.  (B.IO)  and  therefore  proves  our  statement,  i.e.,  j*  =  h.  Thus, 
4ft  ~  44-  From  here  and  from  Eqs.  (5.1)  and  (5.2)  we  get 

rjk  _  pft  J.  ii  =  fjk-l  ,  ^ 


h 

n  n- 


(B.16) 


Now,  from  Eq.  (5.1)  it  follows  trivially  that 

4,j  =  inax(4j-i  +  Tj-i  -I-  TTj-i,  4j-i)  >  4i-i  +  'Tj-i  +  i  ^  1-  (®-12) 

By  iterating  over  the  above  equation  and  then  using  Eq.  (5.2)  we  get 


I’! 


h-l 


di,h  ^  4,1  +  -h  E  (An  +  TTm), 


(B.18) 


m=l 


which,  together  with  Eq.  (B.16),  lead  us  to  Eq.  (B.8). 
This  completes  the  proof  of  the  lemma.  □ 


175 


Lemma  5  The  deadline  of  any  packet  p^,  k  >  1,  at  the  last  hop  h  in  a  network  of  C JVC  seiyers  is 


*^i.h  —  I  +  f>  —  +  ^2  //  +  “ 

V  »»=l  '  '''c 

Proof.  We  consider  two  cases  whether  Sf  =  0  or  not. 

Case  1.  {6f  =  0)  From  Eqs.  (5.2)  and  (5.6)  it  follows  that 


(B.19) 


+fi-  +  xi  (^''' + 


(B.20) 


r»=:  1 


On  the  other  hand,  by  the  definition  of  (see  Eq.  (5.3)  and  Eq.  (5.4))  we  have  + 

Tj_]  +  ttj-]  +  6f'  >  >  1.  From  here  and  from  Eq.  (5.2)  we  obtain 


pk  > 

(li.h  ^  ('iji  +  “• 


(B.21) 


From  this  inequality  and  Eq.  (B.20),  Eq.  (B.19)  follows. 
Case  2.  (S^'  >  0)  By  using  Eqs.  (5.2)  and  (5.10)  we  obtain 

A,h  —  ~  1)4^  +  X] 

m=\ 


(B.22) 
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Since  >  0,  by  using  again  Eq.  (5.2)  and  (5.7)  we  get 


Ik  f'-’ 
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/(-  (iz^ 

>  e^]  +  h.~  h  ^  ^  {Tjjj  +  tlrn)- 

171  =  1 


(B.23) 


which,  together  with  Eq.  (B.22),  lead  to  Eq.  (B.19).  □ 

Theorem  2  The  deadlines  of  a  packet  at  the  last  hop  in  a  network  of  C JVC  sen’ers  is  equal  to  the 
deadline  of  the  same  packet  in  a  corresponding  network  of  Jitter-VC  serx’ers. 
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Proof.  From  Eqs  (5.1)  and  (5.2)  it  is  easy  to  see  that  in  a  network  of  Jitter-VC  servers  we  have 


(B.24) 


Similarly,  in  a  network  of  CJVC  servers,  from  Eqs.  (5.1)  and  (5.7),  and  by  using  the  fact  that 
Si  =  0  (see  Eq.  5.8),  we  obtain  an  identical  expression  for  dj  /^  (i.e.,  Eq.  (B.24)). 

Finally,  since  (a)  the  eligible  times  of  all  packets  pf  at  the  first  hop,  i.e.,  j  (Vk  >  1),  are 
identical  for  both  Jitter-VC  and  CJVC  servers,  and  since  (b)  the  deadlines  of  the  packets  at  the  last 
hop,  i.e.,  (VA:  >  1),  are  computed  based  on  the  same  formulae  (see  Eqs.  (B.8),  (B.19)  and  B.24), 
it  follows  that  ,  (VA;  >  1)  are  identical  in  both  a  network  of  Jitter-VC,  and  a  network  of  CJVC 
servers.  □ 


B.3  Proof  of  Theorem  3 

To  prove  Theorem  3  (see  Section  5.3.3)  we  prove  two  intermediate  results:  Lemma  9  which  gives 
the  buffer  occupancy  for  the  case  when  all  flows  have  identical  rates,  and  Lemma  12  which  gives 
the  buffer  occupancy  for  arbitrary  flow  rates. 

B.3.1  Identical  Flow  Rates 

Consider  a  work-conserving  server  with  an  output  rate  of  one,  which  is  traversed  by  n  flows  with 
identical  reservations  of  1/n.  Assume  that  the  time  axis  is  divided  in  unit  sized  slots,  where  slot  t 
corresponds  to  the  time  interval  [t,  t  +  1).  Assume  that  at  most  one  packet  can  be  sent  during  each 
slot,  i.e.,  the  packet  transmission  time  is  one  time  unit.  Finally,  assume  that  the  starting  times  of 
the  backlogged  periods  of  any  two  flows  are  uncorrelated.  In  practice,  we  enforce  this  by  delaying 
the  first  packet  of  a  backlogged  period  by  an  amount  drawn  from  a  uniform  distribution  in  the 
range  \tarTivaU  '^arrival  +  where  t arrival  is  the  arrival  time  of  the  first  packet  in  the  backlogged 
period.  Note  that  according  to  Eq.  (5.1),  the  eligible  times  of  the  packets  of  a  flow  during  a  flow’s 
backlogged  interval  are  periodic  with  period  n.  Thus,  without  loss  of  generality,  we  assume  that  the 
arrival  process  of  any  flow  during  a  backlogged  interval  is  periodic. 

Let  denote  the  number  of  packets  received  (i.e.,  became  eligible)  during  the  interval 

[i', f"),  and  let  denote  the  number  of  packets  sent  during  the  same  interval.  Note  that 

r(f ,  t”)  and  s(A',  ^")  do  not  include  packets  received/transmitted  during  slot  t" .  Let  q{t)  denote  the 
size  of  the  queue  at  the  beginning  of  slot  t.  Then,  if  no  packets  are  dropped,  we  have 

(B.25) 

Since  at  most  one  packet  is  sent  during  each  time  slot,  we  have  s{t\  t”)  <  t”  —  t'.  The  inequality 
holds  when  [t',  t”)  belongs  to  a  server  busy  period.  A  busy  period  is  defined  as  an  interval  during 
which  the  server’s  queue  is  never  empty.  Also,  note  that  if  is  the  starting  time  of  a  busy  period 

q{t^)  =  0. 

The  next  result  shows  that  to  compute  an  upper  bound  for  q(t),  it  is  enough  to  consider  only  the 
scenarios  in  which  all  flows  are  continuously  backlogged. 


177 


Lemma  6  Let  t\  be  an  arbitraiy  time  slot  during  a  sender  busy  period  that  starts  at  time  U).  Assume 
flow  i  is  not  continuously  backlogged  during  the  inteiral  Then  q{f\)  can  only  increase  if 

floM^  i  becomes  continuously  backlogged  during  [/q,  t] ). 

Proof.  Consider  two  cases  in  which  flow  i  is  idle  during  the  entire  interval  [/o.  ),  and  not. 

If  flow  i  is  idle  during  consider  the  modified  scenario  in  which  flow  i  becomes  back- 

logged  at  an  arbitraiy  time  t  <  to,  and  remains  continuously  backlogged  during  [to,  t[).  In  addition, 
assume  that  the  airival  patterns  of  all  the  other  flows  remain  unchanged.  As  a  result,  it  is  easy  to 
see  that  in  the  modified  scenario,  the  total  number  of  packets  received  during  [to.t\)  can  only  in¬ 
crease,  while  the  starting  time  of  the  busy  interval  can  only  decrease.  Let  7  ',  ,s',  and  (/  denote  the 
coiTesponding  values  in  the  modified  scenario.  Then  (/{to)  >  (lih)  =  0, 7-'(/o./.i)  >  and 

■‘*^(^0) ^])  =  •''■(^0^ ^'1 )  =  ^1  -  U\-  From  Eq.  (B.25)  it  follows  then  that 

In  the  second  case,  when  flow  /  is  neither  idle  nor  continuously  backlogged  during  the  interval 
[^0,  ^1 ),  let  t/  denote  the  time  when  the  last  packet  of  flow  i  arrives  during  [/.().  ti ).  Next  consider  the 
modified  scenario  in  which  flow  i's  packets  amve  at  times:  f  —  na, . . .  ,  t'  ~  11,  tf  t/  -h  77. -f  iib, 
such  that  t/  —  rid  <  to,  and  /)  <  /'  -h  7/7;.  It  is  easy  to  see  then  that  the  number  of  packets  of  flow 
i  that  amve  during  [/().  t\)  is  no  smaller  than  the  number  of  packets  of  flow  i  that  amve  during  the 
same  interval  in  the  original  scenario.  By  assuming  that  the  arrival  patterns  of  all  the  other  flows 
do  not  change,  it  follows  that  7“'(^,()./i)  >  7  (^o*^i)-  In  addition,  since  at  most  t\  ~  to  packets  are 
transmitted  during  [^o  -  )  we  have  s\to>t\)  <  ”  /q.  The  inequality  holds  if,  after  changing  the 

arrival  pattern  of  flow  i,  the  server  is  no  longer  busy  during  the  entire  interval  [/:().  t\ ).  In  addition, 
we  have  ^'(^0)  >  0,  and  from  the  hypothesis  q{to)  —  0.  Finally,  from  Eq.  (B.25)  we  obtain 
^^(^1)  ^  q{l'i)^  which  concludes  the  proof  of  the  lemma.  □ 

As  a  consequence,  in  the  remainder  of  this  section,  we  limit  our  study  to  a  busy  period  in  which 
all  flows  are  continuously  backlogged. 

Let  t[  be  the  time  when  the  last  flow  becomes  backlogged.  Let  to  be  the  latest  time  no  larger  than 
ti  when  the  server  become  busy,  i.e.,  it  has  no  packet  to  send  during  [to  -  1,  to)  and  is  continuously 
busy  during  the  interval  [^0,  +  1).  Then  we  have  the  following  result. 

Lemma  7  If  all  flows  remain  continuously  backlogged  after  time  t\,  the  server  is  busy  for  any  time 
t>to. 

Proof.  By  the  definition  of  ^0.  the  server  is  busy  during  [^o-  )•  Next  we  show  that  the  server  is  also 

busy  for  any  f  ]  >  0. 

Consider  a  flow  that  becomes  backlogged  at  time  t\  Since  its  anival  process  is  periodic  it  follows 
that  during  any  interval  [t/  —  n  +  i,  t!  +  i),  Vz  >  0,  exactly  one  packet  of  this  flow  aixives.  Since 
after  time  t\  all  n  flows  are  backlogged,  exactly  ii  packets  are  received  during  [^1  —  77  +  7,  t\  -f  7'), 
Vi  >  0.  Since  at  most  n  packets  are  sent  during  each  of  these  intervals,  it  follows  that  the  server 
cannot  be  idle  during  any  slot  i,  □ 

Consider  a  buffer  of  size  s.  Our  goal  is  to  compute  the  probability  that  the  buffer  will  overflow 
during  an  arbitrary  interval  [fo,  to-\-d).  From  Lemma  7  it  follows  that  since  the  server  is  busy  during 
[to,  to-\-d),  exactly  d packets  are  transmitted  during  this  interval.  In  addition,  since  the  starting  times 
of  the  backlogged  periods  of  different  flows  are  not  correlated,  in  the  remainder  of  this  section  we 
also  assume  that  the  starting  times  of  a  flow’s  backlogged  period  is  not  coirelated  with  the  starting 
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time,  to,  of  a  busy  period.  Thus,  during  the  interval  [io>  fo  +  d),  a  flow  receives  \d/n]  packets  with 
probability  p{d)  =  djn-  [d/nj ,  and  [d/nj  with  probability  1  —p.  Since  this  probability  is  periodic 
with  period  n  it  will  suffice  to  consider  only  intervals  of  size  equal  to  at  most  n.  Consequently,  we 
will  assume  d  <  n.  The  probability  to  receive  one  packet  during  [to,  to  +  d)  is  then 

p{d)  =  (B.26) 

n 

Let  jo(m;  d)  denote  the  probability  with  which  exactly  m  packets  are  received  during  the  time 
interval  [to,  to  +  d),  where 

p(m;d)  =  -P{d)r~".  (B.27) 

Now,  let  P{x  >  5,  u)  denote  the  probability  with  which  the  queue  size  exceeds  s  at  time  to  +  u. 
Since  the  server  is  idle  at  to  and  busy  during  [to,  to  +  u),  from  Eq.  (B.25),  it  follows  that  the  server’s 
queue  overflows  when  more  than  u  +  5  packets  are  received  during  [to,  to  +  u).  Thus,  we  have 

P(x>s,\l)=  E  (B.28) 

The  next  result  computes  P{x  >  s^u). 


Lemma  8  The  probability  that  a  queue  of  size  s  overflows  at  time  to  +  u  is  bounded  by 

T  (n  +  s)^ 


P{x  >  s,u)  <  /0(n) 

where  ^{n)  = 

Proof.  From  Eq.  (B.27)  we  obtain 


27r  \1  +  (s  —  l)/2n 


4sn 


,  ,  ,  piu)  n  —  m 

pirn  +  1;  u)  =  - - 7-7  - - - 

’  ’  l-p(u)  m+1 


p(m;u), 


By  plugging  the  above  equation  and  Eq.  (B.26)  into  Eq.  (B.28)  we  obtain 


n  j 

P{x>s^u)  =  p{u-\~s;u)  T.  n 


<  p{u  +  s-,u)  E  n 


■ft 

=  p{u  +  s-,u)  Y, 


i—1 

n 

n  —  k\  /' 

k  +  l  j  1 

k=U-\-S 

i—l 

n 

n  —  u  —  s 

U  +  5 

k~u-\-s 

n  —  u 

—  S  U 

u  y-"- 
-  uj 

(—) 

\n  —  u  J 


u  +  s  n 


—) 

-  uJ 


i—u—s 


(B.29) 


(B.30) 


(B.31) 


Next,  it  can  be  easily  verified  that  for  any  positive  reals  a,  b,  and  x,  such  that  6  —  a;  >  0,  we  have 


a 


a  +  X 


b  —  X  ^  /a  +  b  — 

\(i  +  6  +  a;/ 


(B.32) 
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By  taking  a  =  u,  b  =  7i  —  ii,  x  =  s,  Eq.  (B.31)  becomes 

2(i—v—s) 


"  fn-  s\  ^  /n  -  ,s  \  2' 

P{x  >  s,  u)  <  p{u  +  .s;  u)  (  — —  <  p(u  +  .s;  ?/,)  Y  — ^  (B.33) 

\n  +  s)  fYt.  Vn  +  .s-y 

{ii  H- 


/=0 


<  ;;(?/,  H-.s*;w)- 


4.S7I 


Next,  it  remains  to  bound  p{u  +  .s;  u).  From  Eqs.  (B.26)  and  (B,30)  we  have 


p{u  +  .s;  u)  —  p{7i:  ”)n 


s~]  /  -x  .s  — 1  , 

U  71  ~  U  —  l\  T-r  /  71  71  —  71  ~  7. 


<n 


.  x?;,  —  u  u  +  i  +  l  /  \u  +  i  ii  —  ii. 

7=[)  '  ?  — 0  ^ 

By  using  Eq.  (B.32)  with  a  —  u,  h  —  n  —  u,  and  x  =  i,  we  obtain 

p{u  +  .s;  h)  =  p(u\  ^l)  =  p{u:  u)  JJ  ( 

'  71  7/  \7l  +  7  ?l  +  ,S  —  1  —  7 


(B.34) 


i^O 


i=0 


(B.35) 


Again,  by  applying  Eq.  (B.32)  to  the  pairs  (?;  ~  i)/{7i  +  i)  and  {71  —  .s  +  1  +  i)/{7i  +  .s*  ~  1  —  ?;), 
Vi  <  672,  we  have 


p{7i  +  ,s;  71)  <  p{7i,:  7i) 


f27l  -  (.S  -  1) 
2n  +  (.s  -  1) 


2.S 


=  p{7i:  7l) 


1  -  {S  -  1)/277 
1  +  (.s  -  1)/2?7 


2.S- 


(B.36) 


To  bound  p(7i:  ?/)  we  use  Stirling  inequalities  [26],  i.e.,  \/2™(77/c)”  <77!  < 
V?z  >  1.  From  here  we  have 


71  ^ 

,  n  —  71  j 


< 


\/27rn(7?7e)"'^(^/^*^^^^ 


y/2'K7i{7i/ey'  y27r(7i  —  7/.)((77  —  7i)/ey^~'^^ 


(B.37) 


77 


27r(77  —  77)7/  71^^  {71  —  7l)'^~'^^ 


By  combining  Eqs.  (B.26),  (B.27)  and  (B.37),  we  obtain 

p(7/;7i)  <  ^(77 


27r77(77-7l)  ”  ^^^^'V27r‘ 


(B.38) 


where  P{n)  =  and  the  last  inequality  follows  from  the  fact  that  7i/{{7i  —  71)71)  <  1, 

for  any  7/  >  1,  tt-  >  2.  By  plugging  the  above  result  in  Eq.  (B.33)  we  obtain 


r./  X  ^  I  ^  /1-(.S-  1)/277A^"  (77,  +  .s)2 

P(,7;  >  5,7/)  <  /3(77)7 


Z.7i  Vl  +  (5-l)/27t, 


4sn 


(B.39) 


□ 
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Lemma  9  Consider  n  flows  with  identical  rates  and  unit  packet  sizes.  Then  given  a  bujfer  of  size 
5,  where 

the  probability  that  the  buffer  overflows  during  an  arbitrary  time  slot  when  the  server  is  busy  is 
asymptotically  <  e. 

Proof.  To  compute  the  asymptotic  bound  for  P{x  >  s,  u)  assume  that  s  <C  n.  Since  (l-rc)/(l  + 
x)  ~  1  ~  2a;  and  ln(l  —  x)  x,  for  a;  ->  0,  and  since  (n  +  /sn  <  n  for  n  >  5  >  4,  and 
^(n)  <  1.102  for  any  n  >  1,  by  using  Eq.  (B.29)  we  obtain^ 


lnP(a;  >  .s,u)  ^  In  (  P[n)\l  —  1  +  2s  •  In 


1  —  (s  “  l)/2n 
l  +  (s-l)/2n 


+  Inn  — In  4  (B.41) 


-  ln(^(7i)^/—  1  +  2s -Inf  1 


s  —  1 


n 


+  In  n  —  In  4 


s  —  1 


+  In  n  —  In  4 


n 


<  -2-2 


s{s  -  1)  ,  .  s 


n 


+  Inn  2(— 1 - )  +  Inn. 

n 


Using  e  to  bound  P{x>  s,u)  leads  us  to 


P{x  >  s\u)  <e 


2  —1 - 1  -\-lnn  <liie  ^ 


n 


s  > 


(B.42) 


□ 

Next  we  prove  a  stronger  result  by  computing  an  asymptotic  upper  bound  for  the  probability 
with  which  a  queue  of  size  s  overflows  during  an  arbitrary  busy  interval.  Let  Q{x  >  s)  denote  this 
probability.  The  key  observation  is  that  since  all  flows  have  period  n,  the  aggregate  arrival  traffic 
will  have  the  same  period  n.  In  addition,  since  during  each  of  these  periods  exactly  n  packets  are 
received/transmitted  it  follows  that  the  queue  size  at  any  time  to  +  i  •  n  +  j  is  the  same,  ViJ  >  0. 
Consequently,  if  the  queue  does  not  overflow  during  [foj^o  +  w)>  the  queue  will  not  overflow  at 
any  other  time  t  >  ti  during  the  same  busy  period.  Thus,  the  problem  reduces  to  compute  the 
probability  of  queue  overflowing  during  the  interval  [to,  to  +  n).  Then  we  have  the  following  result. 


Corollary  1  Consider  n  flows  with  identical  rates  and  unit  packet  sizes.  Then  given  a  buffer  of  size 

Sy  where  _ _ 

s  >  yj n  (In  n  —  (In  e)/2  —  1 ) ,  (B.43) 

the  probability  that  the  buffer  overflows  during  an  arbitrary  busy  interval  is  asymptotically  <  e. 

^More  precisely  In  I3{n)  -  ln4  <  —2.2081062 _ 
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Proof.  Let  e'  be  the  probability  that  a  buffer  of  size  s  overflows  at  an  instant  t  during  the  busy 
interval  [to,  to  +  u).  Then  the  probability  that  the  buffer  overflows  during  this  interval  is  smaller 
than  1  -  (1  —  e'y  <  ti  ■  e'.  Now,  recall  that  if  the  buffer  does  not  overflow  during  [/.(),fo  + 
the  buffer  will  not  overflow  after  time  to  +  ii.  Thus  the  probability  that  the  buffer  will  not  overflow 
during  an  arbitral^  busy  period  is  less  than  ne' .  Finally,  let  e  =  n  ■  s',  and  apply  the  result  of 
Lemma  9  for  e\  i.e.. 


.s  > 


V~2  2 


(B.44) 


□ 


B.3.2  Arbitrary  Flow  Rates 

In  this  section  we  detennine  the  buffer  bound  for  a  system  in  which  packets  are  of  unit  size,  but  the 
reservations  can  be  arbitrary.  The  basic  idea  is  to  use  a  succession  of  transformations  to  reduce  the 
problem  to  the  case  in  which  the  probabilities  associated  to  the  flows  can  take,  at  most,  three  distinct 
values,  and  then  to  apply  the  results  from  the  previous  case  when  all  reservations  are  assumed  to  be 
identical. 

Consider  n  flows,  and  let  7  /,.  denote  the  rate  reserved  by  flow  k,  where 

r? 

=  1.  (B.45) 

Consider  again  the  case  when  all  flows  are  continuously  backlogged.  Let  denote  the  starting 

time  of  a  busy  period.  Since  the  time  when  flow  A:  becomes  backlogged  is  assumed  to  be  independent 
of  to,  it  follows  that  during  the  interval  [to,  to  +  d)  flow  A:  receives  exactly  [d  •  +  1  packets  with 

probability 

Pk{d)  =  d-rj,-  [d-  7'aJ  ,  (B.46) 

and  [d  ■  r^J  packets  with  probability  1  -  pt.  {d.). 

Let  p{m;  d)  denote  the  probability  with  which  the  server  receives  exactly  X!a-=i  ‘  ^'a  J  +  aaa 
packets  during  the  interval  [to,  h  +  d).  Then 

p{rn\  d)  =  Ty(pi{d),p2(d), . . .  ,Pn{d)),  (B.47) 

where  Ty  {p\{d),p2{d), . . .  ,pn{d))  is  the  coefficient  of  .r”’  in  the  expansion  of 

+  (1  -  Piid))).  (B.48) 

i=l 

Note  that  when  all  flows  have  equal  reservations,  i.e.,  =  l/n,l  <  k  <  n,  Eq.  (B.47)  reduces  to 

Eq.  (B.27). 

By  using  Eq.  (B.46)  the  number  of  packets  received  during  [fo-  +  d.)  can  be  written  as 

n  n  n 

Y^[d  ■  rk\  +  m  =  Ylid  '  rk  -  Pk{d))  +  'm.  =  d-  Y^Pk{d)  +  m.  (B.49) 

/c=l  A;=l  A:=l 
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Since  to  is  the  starting  time  of  the  busy  period  and  since  the  server  remains  busy  during  [<o,  to +  d), 
from  Eq.  (B.25)  it  follows  that  q{to  +  d)  =  m  —  J2k=i  Pk{d). 

Similarly,  the  probability  P{x  >  s,  u)  will  overflow  a  queue  of  size  s  at  time  to  +  is 

n 

P{x>s,u)=  ^  (B.50) 


where  v  =  YJk^i  Pk{u)  +  s- 

Since  in  the  following,  Pk{u)  is  always  defined  over  [to,  to  +  u)  we  will  drop  the  argument  from 
the  pk  (m)’s  notation.  Next,  note  that  for  any  two  flows  k  and  /,  p{m;  u)  can  be  rewritten  as 

p(m;u)  =pkPiAk,i{m)  +  (pi,{l  -  pi)  +  (1  -  Pk)Pl)Bk,i{m)  +  (i  -pfc)(l  -  pi)Ck,i{m), (B. 51) 

where  PkPiAk,i,  represents  all  terms  in  T™(pi,p2,  •  •  •  ,Pn)  that  contain  pkPi,  (pfc(l  -  pi)  +  (1  - 
Pk)Pi)Bk,i  represents  all  terms  that  contain  either  pk(l -pi)  or  (1  -pk)pi,  and  (1  -pk){l -Pi)Ck,i 
represents  all  terms  that  contain  (1  —  p;,)(l  —  pi). 

From  Eqs.  (B.50)  and  (B.51),  the  probability  a  queue  of  size  s  will  overflow  at  time  to  +  u\s 
then 


P{x  >  s,u)  =  ^  p{i;n)  (B.52) 

=  PkPr  Ak,i{v,n)  +  (pfc(l  -pi)  +  {l-Pk)pi)  ■Bk,i{v,n)  + 

(1  -Pfc)(l  -pi)  ■Ck,i(v,n), 

where  Akj{v,n)  =  E”=„+i  Akj{i),  Bk,i{v,n)  =  Ya=v+i  andCfc,;(u,n)  =  Er=«+i  C'fc./W, 

respectively. 

Our  next  goal  is  to  reduce  the  problem  of  bounding  P[x  >  s.  u)  to  the  case  in  which  the  flows’ 
probabilities  take  a  limited  number  of  values.  This  makes  it  possible  to  use  the  results  from  the 
homogeneous  reservations  case  without  compromising  the  bound  quality  too  much.  The  idea  is 
to  iteratively  modify  the  values  of  the  flows’  probabilities,  without  decreasing  P{x  >  s,u).  In 
particular,  we  consider  the  following  simple  transformation:  select  two  probabilities  and  pi  and 
update  them  as  follows: 


Pk  =  Pk-  <5,  (B.53) 

p'l  =  Pi +  6. 

where  (5  is  a  real  value  such  that  0  <  <  1,  and  the  new  computed  probability 

P'{x  >  s,u)  =  p[p'i  ■  Ak,i{v,  n)  +  (p^(l  -  p'l)  +  (1  -  p'k)p'i)  •  Bk,iiv,  n)  +  (B.54) 

(1  -Pfc)(l  -P{)  •Cfc,i(u,n). 

is  greater  or  equal  to  P{x  >  s,u). 

It  is  interesting  note  that  performing  transformation  (B.53)  is  equivalent  to  defining  anew  system 
in  which  the  reservations  of  flows  k  and  I  are  changed  to  rj.  and  r[,  respectively,  such  that  = 
d  •  f'k  —  [d  •  f'kl^  p'l  =  d  •  r'l  —  [d  ■  r[J .  There  are  two  observations  worth  noting  about  this 


system.  First,  by  choosing  r[,  ~  r/,.  —  6/d  and  r'l  =  77  +  we  maintain  the  invariant  =  1* 

Second,  while  in  the  new  system  the  start  time  to  of  the  busy  period  may  change,  this  will  not 
influence  P^x  >  s,  u)  as  this  depends  only  on  the  length  of  the  interval  [/q.  to  +  v). 

Next,  we  give  the  details  of  our  transformation.  From  Eqs.  (B.52),  (B.53)  and  (B.54),  after  some 

simple  algebra,  we  obtain 

P'(x  >  .s,?/)  —  P{x  >  s.u)  —  6{p^.  -  Pi  —  6)'Di^-j{jk7i),  (B.55) 

where 

V{Lj)  ^  (B.56) 

Recall  that  our  goal  is  to  choose  6  such  that  P\x  >  s,u)  >  P{x  >  .s).  Without  loss  of 
generality  assume  that  p/,.  >  pi.  We  consider  two  cases:  (1)  if  Vi^-j{'ik7i)  >  0,  then  6  >  {)  and 
Pk  >  Pi  +  6  {6  <  0  and  pi^  <  pi  +  6  cannot  be  simultaneously  tine);  (2)  if  11)  <  0,  then 

either  6  >0  and  pj^  <  pi  +  6,  or  <  0  and  pi^-  >  Pi  -h  6. 

Letpr7mi  —  miiii<K;,  p/,  and  Pf^tax  =  Vi’>  respectively.  Consider  the  following  three 

subsets,  denoted  U,  V,  and  Af,  where  U  contains  all  flows  k  such  that  pi-  —  Pnuv^  V  contains  all 
flows  k  such  that  p/,.  =  Pjnax^  M  contains  all  the  other  flows.  The  idea  is  then  to  successively 
apply  the  transfonnation  (B.53)  on  p\,P2,  •  •  •  until  the  probabilities  of  all  flows  in  M  become 
equal.  In  this  way  we  reduce  the  problem  to  the  case  in  which  the  probabilities  p/,  can  take  at  most 
three  distinct  values:  Pmin^  Pniax>  andpM,  where  —  Pm^  VA:  G  M.  Figure  B.2  shows  the  iterative 
algorithm  that  achieves  this.  Lemmas  10  and  1 1  prove  that  by  using  the  algorithm  in  Figure  B.2, 
Pi^P2, . . .  ,Pn  converge  asymptotically  to  the  three  values. 

while  (|i\/|  >  1)  do  /*  while  size  of  M  is  greater  than  one  */ 

Pi  =  mmieM{pi)\ 

Pk  ^ 

if  (Da  ,/ (7^,  u)  >  0) 

Pk  =  Pi  =  {pk  +P/)/2; 

else 

S  =  max(pA.  -  Pn,n.r,Pmin  -  Pl)~ 

Pk  =Pk  -^\Pi  =Pl  +S: 
if  (pi  =  p,„i„) 

M  =  M\{l}\U  =  uu{l}-, 

(Pk  —  Piiin  r) 

M  =  M\{ky,V  =  Vu{k}; 

Figure  B.2:  Reducing  pi , ....  to  three  distinct  values. 


Lemma  10  After  an  iteration  of  the  algorithm  in  Figure  B.2,  either  the  size  of  M  decreases  by  one, 
or  the  standard  deviation  of  the  probabilities  in  M  decreases  by  a  factor  of  at  least  (1  — 


Proof.  The  first  part  is  trivial;  if  Dk,i{v,  n)  <  0  the  size  of  M  decreases  by  one.  For  the  second 
part,  let  p  denote  the  average  values  of  probabilities  associated  to  the  flows  in  M,  i.e.. 


-  _  HieM Pi 
”  \M\  ■ 


(B.57) 
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The  standard  deviation  associated  to  the  probabilities  in  M  is 


dev  ='^{pi-  pf.  (B.58) 

ieM 

After  averaging  probabilities  pk  and  pi,  standai*d  deviation  v  changes  to 

dev'  =  dev  +  2  _  (p^  _  p)2  _  (p^  _  p)2  _  _  iPk_JPi)_^  (B.59) 

Since  P/e  andp/  are  the  lowest,  and  respectively,  the  highest  probabilities  in  M  we  have  {pi  —  p)^  < 
(p/  -  p/e)^,  Vi  €  M,  From  here  and  from  Eqs.  (B.58)  and  (B.59)  we  have 

dev  =  (pi  “  p)^  <  |M|(p/  ~  pkf  ^  2\M\{dev  -  dev^)  =>  dev'  <  dev  •  ^1  -  (B.60) 

ieA4  \  I  |/ 

□ 

Lemma  11  Consider  n  flowSy  and  let  pi  denote  the  probability  associated  with  flow  i.  Then,  by 
using  the  algorithm  in  Figure  B.2,  the  probabilities  Pi  (1  <  i  <  n)  converge  tOy  at  mosty  three 
values. 


Proof.  Let  e  be  an  arbitrary  small  real.  The  idea  is  then  to  show  that  after  a  finite  number  of 
iterations  of  the  algorithm  in  Figure  B.2,  the  standard  deviation  of  p^’s  (i  €  M)  becomes  smaller 
than  e. 

The  standard  deviation  for  the  probabilities  of  flows  in  M  is  trivially  bounded  as  follows 

dev  =  ^  ^  (pi  “  p)  ^  ^  ^  iPmax  ~  Prain)  ”  \^\{Pmax  ~  Pmin)  ^  '^{Pmax  ^  Pmin)  •  (B.61) 
ieM  ieM 


Assume  Dkflv^n)  >  0  (i.e.,  M  does  not  change)  for  ni  consecutive  iterations.  Then,  by  using 
Lemma  10,  it  is  easy  to  see  that  ni  is  bounded  above  by  iV,  where 


dev  • 


0  2\M\ 


N 


<  dev  •  ( 1 


2n) 


N  = 


\n{e/dev) 
ln(l  -  l/(2n))‘ 


(B.62) 


Since  the  above  bound,  iV,  holds  for  any  set  M,  it  follows  that  after  nN  iterations,  we  are  guaranteed 
that  either  set  M  becomes  empty,  a  case  in  which  the  lemma  is  trivially  true,  or  dev  <  6.  □ 

Thus,  we  have  reduced  the  problem  to  compute  an  upper  bound  for  probability  P{x  >  5,  u)  in 
a  system  in  which  probabilities  take  only  three  values  at  time  u:  Pmin^  Pmax^  and  pM- 
Next  we  give  the  main  result  of  this  section 


Lemma  12  Consider  n  flows  with  unit  packet  sizes  and  arbitrary  flow  reservations.  Then  given  a 
buffer  of  size  s,  where 

the  probability  that  the  buffer  overflows  in  an  arbitrary  time  slot  during  a  server  busy  period  is 
asymptotically  <  e. 
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Proof.  Consider  the  probability,  P{x  >  .s,  ?/.),  with  which  the  queue  overflows  at  time  (see 

Eq.  (B.50)).  Next,  by  using  the  algorithm  in  Figure  B.2,  we  reduce  probabilities  p,’s  (1  <  *  <  n)  to 
three  values:  Pmin^Pinnx^  and  pj[i,  respectively.  Let  pj  denote  the  final  probability  of  flow  i,  and  let 
P^x  >  .s,  il)  denote  the  final  probability  of  the  queue  overflowing  at  time  to  +  u.  More  precisely, 
from  Eqs.  (B.50)  and  (B.47)  we  have 

7)  /? 

P^{x>s,u)=  p^{i:u)=  Y  ■■■]>, i).  (B.64) 

/=r+l  /=r+] 

where  v  =  +  .s,  and  pi  =  p,„i„,  Vi  G  U,  p,  =  Pnuix>  V?  G  V,  and  p,  =  p^,  Vi  G  V. 

Since  after  each  transformation  P{x  >  s.  ii.)  can  only  increase,  we  have  P^ (x  >  s./ii.)  >  P{x  > 
s,u). 

Let  riij,  n\-,  and  be  the  number  of  flows  in  sets  U,  V,  and  M,  respectively.  Define  integers 
v\-,  and  V]\r,  such  that  v  =  +v\-  +  v,\i,  and  ?>;•  <  v,r,v\-  <  7?.\-,  and  <  7ii\i,  respectively. 

Then,  it  can  be  shown  that 


P^(x  >  S.  u)  <  Pi  +  Py  +  Pjy.  (B.65) 

where 

A'  =  L  (B.66) 

i  =  ry-\^]  \  ) 

Pm  =  E 

Due  to  the  notation  complexity  we  omit  the  derivation  of  Eq.  (B.65).  Instead,  below  we  give  an 
alternate  method  that  achieves  the  same  result. 

The  key  observation  is  that  Py  represents  the  probability  with  which  more  than  Ylieii  J  +Wf/ 
packets  from  flows  in  U  arrive  during  the  interval  [^o-  h  +  u).  This  is  easy  to  see,  as  the  probability 
that  exactly  ■  nj  +  ni  packets  from  flows  in  U  arrive  during  [fo.  +  u)  is  )p"I,:„(l  - 

P7nin)'^^’~"’  (see  Eq.  (B.47)  for  comparison). 

Similarly,  P\  ■  is  the  probability  that  more  than  51, g \  ■  [^t  •  r,  J  +vy  packets  from  flows  in  V  arrive 
during  [<0,^0  +  u),  while  P^  is  the  probability  that  more  than  packets  from 

flows  in  M  arrive  during  the  same  interval. 

Consequently,  (1  —  Pf/)(1  —  P\  )(l  —  P/v/)  represents  the  probability  with  which  no  more  than 
X)7;er/L’^'’'d  +'>’Ar  packets  are  received  from  flows  in  P, 

V,  and  M  during  [fo,  fo  +  u).  Clearly  this  probability  is  no  larger  than  the  probability  of  receiving 
no  more  than  [u  •  rd  +  v  packets  from  all  flows  during  the  interval  [<0,^0  +  u),  a  probability 
which  is  exactly  1  —  P^{x  >  s.u).  This  yields 

1-P^{x>s,u)  >  (1  -  Py){l  -  Py)il  -  Pm)  (B.67) 

PHx>s,u)  <  l-{l-Pu){l-Py){i-P,;)<Pu  +  Pv  +  PM- 
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Next,  consider  the  expression  of  Pu  in  Eq.  (B.66).  Let 


su  —  vu  —  uij, 


(B.68) 


where  uu  =  Pminnu-  Then  it  is  easy  to  see  that  the  expressions  of  pmin  (i-e.,  Pmin  -  uu/nu) 
and  Pu,  given  by  Eq.  (B.66),  are  identical  to  the  expressions  of  p{d)  and  P{x  >  s,u),  given  by 
Eqs.  (B.26)  and  (B.28),  respectively,  after  the  following  substitutions:  d  uu,  n  ^  nu,  u  ■(-  uu, 
s  Sf/.  By  applying  the  result  of  Lemma  8  we  have  the  following  bound 


Pu 


< 


nu  /  \ 

( i  ~ 

,  /T”  n  -  (st;  -  l)/2n{/y*‘'  {nu+suf' 
2n\l  +  {su-  l)/2nu)  4sunu 


(B.69) 


Next  we  compute  su,  such  that 

1  -  R(  \  n~  ~  jnu  +  su)'^ 

27r\l  +  {su-l)/2nuJ  4sunu 


(B.70) 


By  applying  the  same  approximations  used  in  proving  Lemma  9  (see  Eq.  (B.41)),  i.e.,  su  <  nu, 
sy  <  ny,  and  sm  <  nM,  respectively,  we  get 


and  similarly 


/  flnnu  ln(£/3) 

- 2““' 


/  finny  ln(£:/3)  ^ 

- 2~“V 

I  finuM  ln(e/3)  ^ 

SM  ss  - —-^)- 


(B.71) 


(B.72) 


By  using  the  above  values  for  su,  sy,  and  sm,  respectively,  and  by  the  definition  of  P^{x  > 
s,  u)  and  Eq.  (B.65),  we  have 

P{x  >  s,u)  <  P^{x  >  s,u)  <  Pu  +  Py  +  Pm  <  3  •  -  =  e.  (B.73) 


Now  it  remains  to  compute  s.  First,  recall  that  su  =  vu  —  uu,  sy  =  vu—  uy,  sm  =  um  —  um, 
where  uu  =  PminU,  uy  =  PmaxU,  and  um  =  Pmu  (see  Eq.  (B.68)).  From  here  we  obtain 


Su  +  Sy  +  Sm  = 


{vu  -  nu)  +  {vy  -  ny)  +  {vm  -  nM) 
V  —  nu  —  ny  —  nM  —  3 


—  V  Pmin’^U  Pmax'^V  PM'^M 

”  ^  Pmin  y  V  Pmax  ~  ^  ^  PM 

i^U  i€V  zEM 

n 

=  v-^Pi  =  S. 

i—l 


(B.74) 
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As  both  P{x  >  s,  ?i)  and  {x  >  s,  u)  decrease  in  s,  for  our  purpose  it  is  sufficient  to  determine 
an  upper  bound  for  s.  From  Eqs.  (B.71),  (B.72)  and  (B.74)  this  reduces  to  compute 


(B.75) 


subject  to  77, (/  +  n\  -  +  ?ia/  =  7i.  Since  the  function  \/x  In  x  is  concave,  it  follows  that  expression 
(B.75)  achieves  maximum  for  nu  —  7i.y  =  71m  —  »'/3.  Finally,  we  choose 


.s  =  3- 


7),  /hl(7i/3) 

3  V  2 


(B.76) 


which  completes  the  proof.  □ 

By  combining  Lemmas  9  and  1 2  we  have  the  following  result 


Theorem  3  Considei-  a  ser\’C)-  traversed  by  7i  flows.  Assimtc  that  the  arrival  times  of  the  pack¬ 
ets  from  differettt  flows  are  mdepetident,  and  that  all  packets  have  the  same  size.  Then,  for  atiy 
given  probability  e,  the  queue  size  at  any  time  instant  during  a  server  busy  period  is  asymptotically 
bounded  above  by  .s,  where 


s  = 


(B.77) 


with  a  probability  larger  than  1  —  e.  For  idetrtical  reservations  /3  =  1;  for  heterogeneous  reserva¬ 
tions  0  =  3. 


B.4  Proof  of  Theorem  4 


Theorem  4  Consider  a  link  of  capacity  C  at  tirrre  t.  Assurrte  that  no  re.7ervation  terminates  arrd 
there  are  no  reservation  failures  or  request  losses  after  tirrre  t.  Then  if  there  is  sufficient  demand 
after  t  the  link  utilization  approaches  asymptotically  C{\  —  /)/(!  +  /). 


Proof.  If  the  aggregate  reservation  at  time  t  is  larger  than  (7(1  —  /)/(!  +  /),  the  proof  is  trivially 
true.  Next,  we  consider  the  case  in  which  the  aggregate  reservation  is  less  than  (7(1  —  /)/(!  +  /). 

In  particular,  let  (7(1  —  /)/(!  +  /)  —  A  be  the  aggregate  reservation  at  time  t.  Without  loss 
of  generality  assume  t  =  u/; .  Then  we  will  show  that  if  no  reservation  terminates,  no  reservation 
request  fails,  and  there  is  enough  demand  after  time  ?iA-,  then  at  least  (1  +  /)A/2  bandwidth  is 
allocated  during  the  next  two  slots,  i.e.,  during  the  interval  (7/.a-,  7iA  +2]-  Thus,  for  any  arbitrary  small 
real  e,  we  are  guaranteed  that  after,  at  most. 


ln(£/A) 

M(l-/)/2) 


(B.78) 


slots  the  aggregate  reservation  will  exceed  (7(1  —  /)/(!+/)  —  e. 

From  Eq.  (5.20)  it  follows  that  the  maximum  capacity  which  can  be  allocated  during  the  interval 
(7/fe,UA+i]  is  max(C'  -  i?co/(wfc)i0)-  Assume  then  that  Aj  capacity  is  allocated  during  (uA,iiA+i], 
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Figure  B.3:  The  scenario  in  which  the  upper  bound  of  6/,  i.e.,  ri{Tw  ~Tj  —  Tj),  is  achieved.  The  arrows 
represent  packet  transmissions.  Tvr  is  the  averaging  window  size;  Tj  is  an  upper  bound  on  the  packet  inter¬ 
departure  time;  Tj  is  an  upper  bound  on  the  delay  jitter.  Both  ml  and  m2  fall  just  inside  the  estimation 
interval,  Ths  at  the  core  node. 

where  Ai  <  max((7  -  Consider  two  cases  whether  Ai  >  A  or  not.  If  Ai  >  A,  the 

proof  follows  trivially. 

Assume  Ai  <  A.  Then  we  will  show  that  at  time  Uk-\^2  the  aggregate  reservation  can  increase  by 
at  least  a  constant  fraction  of  A.  From  Figure  B.3  is  easy  to  see  that,  for  any  reservation  continuously 
active  during  an  interval  {uk^Uk-\-i],  we  have 

bi{uk,  Uk-\^i)  <  ri{Tw  +  T/  +  Tj).  (B.79) 

Since  no  reservation  terminates  during  we  have  C{uk-\-i)  =  C{uk)  U  Af{uf^^i).  Let 

aci  £  (ufc,  be  the  time  when  flow  i  becomes  active  during  Uk+i]-  Since  bi{aci,  n^+i)  < 
bi{uk^Uk-\-i),  by  using  Eq.  (B.79),  we  obtain 

B{uk,Uk^i)  =  ^  bi{uk,Uk^i)  <  ^  ri{Tw +  Ti  ^Tw)^  (B.80) 

ieC{uk^i) 

From  here  we  get 

RDPs{'^k,Uk-\-i)  <  R{uk^i){l  +  f),  (B.81) 

Since  there  are  no  duplicate  requests  or  partial  reservation  failures  after  time  t  =  we  have 
Ai  —  Rnew{'ki>k-\-i)-  From  here  and  from  Eq.  (5.20)  and  Eq.  (B.81)  we  have 

Rcal{'^k-\-l)  ^  +  Ai  <  R{uk^i)  2  __  y  ■*"  ^1-  (B.82) 

In  addition,  we  have  i?(njt_i_i)  =  R{uk)  +  Ai.  Since  R{uk)  =  C(1  —  /)/(!  +  /)  —  A,  from 
Eq.  (B.82),  it  follows 

C-Rcai{uk+i)>C-Riuk+i)^^-A,>^^A-j^Ai.  (B.83) 

Finally,  consider  two  cases  whether  (a)  Ai  <  A(1  +  /) /2,  or  (b)  not.  If  (a)  is  true  then  the  link 
can  allocate  up  to 
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Ai+C-i?,„,0/,A.+,)>A,  +  [^A-^Ai  =  i^(A-A,)>  (B.84) 

capacity  during  the  time  interval  '!/A'4-2]-  Iti  case  (b)  we  have  trivially  Ai  >  A(1  +  /)/2.  Thus 
in  both  cases  we  can  allocate  at  least  A(1  +  /)/2  new  capacity  during  (?tA ,  □ 


7 
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